## Creating the dataset

We used the "Realtor" website for our research. We decided to explore and exemine what affects the price of a house.

As we started the exploration and to the finding of the final website, we came across a lot of Real Estate websites that would't allow to pull their data of the house listing. Some blocked us, others had a limit on the data that could be pulled.

Thus, we've decided to "trick" the website.

The content of the website is not static - it is dynamically rendered by sending a POST request to the API Realtor.com. The response of the request is then returns the Data of the Website in a JSON Formatted string. 

The trick is to “fake” the POST request in our Python script, pretending that we are the Website that is requesting the data from the API.

After we searched for the desired state, (in our project it was New York) we pushed the F12 button on the keyboard --> clicked on “Network” --> “Fetch/XHR” - we can see which requests the Website is sending when someone is accessing it.

When clicking on the “Response” tab, we can see that the response to the request is the data of the real estate listings that are displayed on the web page.

With POST requests it is necessary to send a request payload to specify what data we want to retrieve. 

We found out to which URL we have to send our Request and which payload we need to send along with the Request. Then we looped over the list that contains the JSON Data of each page and extracted the data.

The “json_data” variable contains a list of the real estate listings and our goal was to extract the desired information from it. We looped over each item and created a feature dictionary that we appended to a list to create a Pandas DataFrame.


We used Jason, Pandas, Requests and NumPy libraries to perform the scraping.
The features we scraped for each listing on the website were:

    • price
    • beds    
    • baths    
    • garage    
    • stories    
    • house_type    
    • lot_sqft    
    • sqft    
    • year_built    
    • address
    • state
    • city
    • county


Then we saved the DataFrame to "RealEstateNewYork.csv" file.

## Setup

In [1]:
import requests 
import json 
import pandas as pd
import numpy as np

In [2]:
def send_request(page_number: int, offset_parameter: int):
    url = "https://www.realtor.com/api/v1/hulk?client_id=rdc-x&schema=vesta"
    headers = {"content-type": "application/json"}

    body = r'{"query":"\n\nquery ConsumerSearchMainQuery($query: HomeSearchCriteria!, $limit: Int, $offset: Int, $sort: [SearchAPISort], $sort_type: SearchSortType, $client_data: JSON, $geoSupportedSlug: String!, $bucket: SearchAPIBucket, $by_prop_type: [String])\n{\n  home_search: home_search(query: $query,\n    sort: $sort,\n    limit: $limit,\n    offset: $offset,\n    sort_type: $sort_type,\n    client_data: $client_data,\n    bucket: $bucket,\n  ){\n    count\n    total\n    results {\n      property_id\n      list_price\n      primary_photo (https: true){\n        href\n      }\n      source {\n        id\n        agents{\n          office_name\n        }\n        type\n        spec_id\n        plan_id\n      }\n      community {\n        property_id\n        description {\n          name\n        }\n        advertisers{\n          office{\n            hours\n            phones {\n              type\n              number\n            }\n          }\n          builder {\n            fulfillment_id\n          }\n        }\n      }\n      products {\n        brand_name\n        products\n      }\n      listing_id\n      matterport\n      virtual_tours{\n        href\n        type\n      }\n      status\n      permalink\n      price_reduced_amount\n      other_listings{rdc {\n      listing_id\n      status\n      listing_key\n      primary\n    }}\n      description{\n        beds\n        baths\n        baths_full\n        baths_half\n        baths_1qtr\n        baths_3qtr\n        garage\n        stories\n        type\n        sub_type\n        lot_sqft\n        sqft\n        year_built\n        sold_price\n        sold_date\n        name\n      }\n      location{\n        street_view_url\n        address{\n          line\n          postal_code\n          state\n          state_code\n          city\n          coordinate {\n            lat\n            lon\n          }\n        }\n        county {\n          name\n          fips_code\n        }\n      }\n      tax_record {\n        public_record_id\n      }\n      lead_attributes {\n        show_contact_an_agent\n        opcity_lead_attributes {\n          cashback_enabled\n          flip_the_market_enabled\n        }\n        lead_type\n      }\n      open_houses {\n        start_date\n        end_date\n        description\n        methods\n        time_zone\n        dst\n      }\n      flags{\n        is_coming_soon\n        is_pending\n        is_foreclosure\n        is_contingent\n        is_new_construction\n        is_new_listing (days: 14)\n        is_price_reduced (days: 30)\n        is_plan\n        is_subdivision\n      }\n      list_date\n      last_update_date\n      coming_soon_date\n      photos(limit: 2, https: true){\n        href\n      }\n      tags\n      branding {\n        type\n        photo\n        name\n      }\n    }\n  }\n  geo(slug_id: $geoSupportedSlug) {\n    parents {\n      geo_type\n      slug_id\n      name\n    }\n    geo_statistics(group_by: property_type) {\n      housing_market {\n        by_prop_type(type: $by_prop_type){\n          type\n           attributes{\n            median_listing_price\n            median_lot_size\n            median_sold_price\n            median_price_per_sqft\n            median_days_on_market\n          }\n        }\n        listing_count\n        median_listing_price\n        median_rent_price\n        median_price_per_sqft\n        median_days_on_market\n        median_sold_price\n        month_to_month {\n          active_listing_count_percent_change\n          median_days_on_market_percent_change\n          median_listing_price_percent_change\n          median_listing_price_sqft_percent_change\n        }\n      }\n    }\n    recommended_cities: recommended(query: {geo_search_type: city, limit: 20}) {\n      geos {\n        ... on City {\n          city\n          state_code\n          geo_type\n          slug_id\n        }\n        geo_statistics(group_by: property_type) {\n          housing_market {\n            by_prop_type(type: [\"home\"]) {\n              type\n              attributes {\n                median_listing_price\n              }\n            }\n            median_listing_price\n          }\n        }\n      }\n    }\n    recommended_neighborhoods: recommended(query: {geo_search_type: neighborhood, limit: 20}) {\n      geos {\n        ... on Neighborhood {\n          neighborhood\n          city\n          state_code\n          geo_type\n          slug_id\n        }\n        geo_statistics(group_by: property_type) {\n          housing_market {\n            by_prop_type(type: [\"home\"]) {\n              type\n              attributes {\n                median_listing_price\n              }\n            }\n            median_listing_price\n          }\n        }\n      }\n    }\n    recommended_counties: recommended(query: {geo_search_type: county, limit: 20}) {\n      geos {\n        ... on HomeCounty {\n          county\n          state_code\n          geo_type\n          slug_id\n        }\n        geo_statistics(group_by: property_type) {\n          housing_market {\n            by_prop_type(type: [\"home\"]) {\n              type\n              attributes {\n                median_listing_price\n              }\n            }\n            median_listing_price\n          }\n        }\n      }\n    }\n    recommended_zips: recommended(query: {geo_search_type: postal_code, limit: 20}) {\n      geos {\n        ... on PostalCode {\n          postal_code\n          geo_type\n          slug_id\n        }\n        geo_statistics(group_by: property_type) {\n          housing_market {\n            by_prop_type(type: [\"home\"]) {\n              type\n              attributes {\n                median_listing_price\n              }\n            }\n            median_listing_price\n          }\n        }\n      }\n    }\n  }\n}","variables":{"query":{"status":["for_sale","ready_to_build"],"primary":true,"state_code":"NY"},"client_data":{"device_data":{"device_type":"web"},"user_data":{"last_view_timestamp":-1}},"limit":42,"offset":42,"zohoQuery":{"silo":"search_result_page","location":"New York","property_status":"for_sale","filters":{},"page_index":"2"},"sort_type":"relevant","geoSupportedSlug":"","by_prop_type":["home"]},"operationName":"ConsumerSearchMainQuery","callfrom":"SRP","nrQueryType":"MAIN_SRP","visitor_id":"eff16470-ceb5-4926-8c0b-6d1779772842","isClient":true,"seoPayload":{"asPath":"/realestateandhomes-search/New-York/pg-2","pageType":{"silo":"search_result_page","status":"for_sale"},"county_needed_for_uniq":false}}'
    json_body = json.loads(body)
    
    json_body["variables"]["page_index"] = page_number
    json_body["seoPayload"] = page_number
    json_body["variables"]["offset"] = offset_parameter


    r = requests.post(url=url, json=json_body, headers=headers)
    json_data = r.json()
    return json_data

In [3]:
offset_parameter = 0

json_data_list = []

for page_number in range(1, 207):
    json_data = send_request(page_number, offset_parameter=offset_parameter)
    json_data_list.append(json_data)
    offset_parameter +=42

In [4]:
def extract_features(entry: dict):
    feature_dict = {
        "price": entry["list_price"],
        "beds": entry["description"]["beds"],
        "baths": entry["description"]["baths"],
        "garage": entry["description"]["garage"],
        "stories": entry["description"]["stories"],
        "house_type": entry["description"]["type"],
        "lot_sqft": entry["description"]["lot_sqft"],
        "sqft": entry["description"]["sqft"],
        "year_built": entry["description"]["year_built"],
        "address": entry["location"]["address"]["line"],
        "state": entry["location"]["address"]["state_code"],
        "city": entry["location"]["address"]["city"],
    }
    
    if entry["location"]["address"]["coordinate"]:
        feature_dict.update({"lat": entry["location"]["address"]["coordinate"]["lat"]})
        feature_dict.update({"lon": entry["location"]["address"]["coordinate"]["lon"]})
    if entry["location"]["county"]:
        feature_dict.update({"county": entry["location"]["county"]["name"]})
    
    return feature_dict

In [5]:
feature_dict_list = []

for data in json_data_list:
    for entry in data["data"]["home_search"]["results"]:
        feature_dict = extract_features(entry=entry)
        feature_dict_list.append(feature_dict)

df = pd.DataFrame(feature_dict_list)
df


Unnamed: 0,price,beds,baths,garage,stories,house_type,lot_sqft,sqft,year_built,address,state,city,lat,lon,county
0,159900.0,4.0,2.0,1.0,,mobile,17424.0,1788.0,1973.0,90 E Main St,NY,Granville,43.405985,-73.251559,Washington
1,294900.0,3.0,2.0,2.0,2.0,single_family,74052.0,996.0,2011.0,16326 Ontario Shores Dr,NY,Sterling,43.404835,-76.635019,Cayuga
2,225000.0,3.0,2.0,1.0,,single_family,30056.0,1224.0,1973.0,38 Pine Cir,NY,Newfield,42.357008,-76.607137,Tompkins
3,149000.0,4.0,2.0,2.0,,single_family,223898.0,1608.0,1900.0,8 Gridleyville Rd,NY,Spencer,42.223019,-76.430742,Tioga
4,599999.0,4.0,2.0,,2.0,single_family,7307.0,1827.0,1858.0,59 Hamilton Ave,NY,Oyster Bay,40.874207,-73.531903,Nassau
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8647,975000.0,1.0,2.0,,,condos,,826.0,1987.0,75 Wall St Apt 24K,NY,New York,40.705042,-74.008117,New York
8648,1195000.0,2.0,2.0,,29.0,condos,,,1984.0,311 E 38th St Apt 10E,NY,New York City,40.747242,-73.973217,New York
8649,689000.0,3.0,3.0,1.0,2.0,single_family,7475.0,2100.0,2022.0,223 Endicott Ave,NY,Elmsford,41.063712,-73.809473,Westchester
8650,1862500.0,5.0,4.0,1.0,,single_family,4920.0,2750.0,1955.0,147-40 8th Ave,NY,Whitestone,40.792845,-73.819437,Queens


### Saving to csv file

In [6]:
df.to_csv('RealEstateNewYork.csv', index=False)