# Redfin Scraper Construction

This file is meant to explore the `redfin_scraper` package in order to create some functions that allow us to easily call listing informations across the country.

Import packages and files:

In [16]:
from redfin_scraper import RedfinScraper
import pandas as pd
import os


zip_codes = pd.read_csv('../data/zip_code_database.csv')

Filter for locations that we are curious about in the zip codes file provided on this website: [US Zip Codes](https://www.unitedstateszipcodes.org/zip-code-database/#)

In [21]:
print(len(zip_codes))
zip_codes.head(8)

42735


Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population
0,501,UNIQUE,0,Holtsville,,Internal Revenue Service,NY,Suffolk County,America/New_York,631.0,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Internal Revenue Service,NY,Suffolk County,America/New_York,631.0,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939.0,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939.0,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787.0,,US,18.43,-67.15,0
5,604,PO BOX,0,Aguadilla,Ramey,,PR,,America/Puerto_Rico,,,US,18.43,-67.15,0
6,605,PO BOX,0,Aguadilla,,,PR,,America/Puerto_Rico,,,US,18.43,-67.15,0
7,606,STANDARD,0,Maricao,,Urb San Juan Bautista,PR,Maricao Municipio,America/Puerto_Rico,787939.0,,US,18.18,-66.98,0


Filtering out 'decommissioned' cities:

In [38]:
filtered_zip_codes = zip_codes[zip_codes['decommissioned']==0]
filtered_zip_codes

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population
0,501,UNIQUE,0,Holtsville,,Internal Revenue Service,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Internal Revenue Service,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787,,US,18.43,-67.15,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42730,99926,PO BOX,0,Metlakatla,,,AK,Prince of Wales-Outer Ketchikan Borough,America/Metlakatla,907,,US,55.14,-131.49,1140
42731,99927,PO BOX,0,Point Baker,,,AK,Prince of Wales-Hyder Census Area,America/Sitka,907,,US,56.30,-133.57,48
42732,99928,PO BOX,0,Ward Cove,,,AK,Ketchikan Gateway Borough,America/Sitka,907,,US,55.45,-131.79,1530
42733,99929,PO BOX,0,Wrangell,,,AK,Wrangell City and Borough,America/Sitka,907,,US,56.41,-131.61,2145


Setting up the scraper:

In [22]:
scraper = RedfinScraper()
scraper.setup(zip_database_path="../data/zip_code_database.csv")

Going through an example of a city:

In [39]:
new_df = scraper.scrape(city_states=['New York, NY'])
print('Number of Rows Total', len(new_df))
print('Snippet:')
new_df.head(3)

Number of Rows Total 4725
Snippet:


Unnamed: 0,SALE TYPE,SOLD DATE,PROPERTY TYPE,ADDRESS,CITY,STATE OR PROVINCE,ZIP OR POSTAL CODE,PRICE,BEDS,BATHS,...,STATUS,NEXT OPEN HOUSE START TIME,NEXT OPEN HOUSE END TIME,URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING),SOURCE,MLS#,FAVORITE,INTERESTED,LATITUDE,LONGITUDE
0,MLS Listing,,Condo/Co-op,30 Park Pl Unit 42C,New York,NY,10007.0,2750000.0,1.0,1.5,...,Active,,,https://www.redfin.com/NY/New-York/30-Park-Pl-...,REBNY,RPLU-33422328259,N,Y,40.71292,-74.008863
1,MLS Listing,,Condo/Co-op,30 Park Pl Unit 58D,New York,NY,10007.0,4550000.0,2.0,2.0,...,Active,,,https://www.redfin.com/NY/New-York/30-Park-Pl-...,REBNY,RPLU-33422296275,N,Y,40.71292,-74.008863
2,MLS Listing,,Condo/Co-op,30 Park Pl Unit 60C,New York,NY,10007.0,4199000.0,2.0,2.5,...,Active,,,https://www.redfin.com/NY/New-York/30-Park-Pl-...,REBNY,RPLU-3346130942,N,Y,40.71292,-74.008863


All the columns of this dataset:

In [35]:
new_df.columns

Index(['SALE TYPE', 'SOLD DATE', 'PROPERTY TYPE', 'ADDRESS', 'CITY',
       'STATE OR PROVINCE', 'ZIP OR POSTAL CODE', 'PRICE', 'BEDS', 'BATHS',
       'LOCATION', 'SQUARE FEET', 'LOT SIZE', 'YEAR BUILT', 'DAYS ON MARKET',
       '$/SQUARE FEET', 'HOA/MONTH', 'STATUS', 'NEXT OPEN HOUSE START TIME',
       'NEXT OPEN HOUSE END TIME',
       'URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)',
       'SOURCE', 'MLS#', 'FAVORITE', 'INTERESTED', 'LATITUDE', 'LONGITUDE'],
      dtype='object')

Saving it to the folder:

In [37]:
new_df.to_csv('../data/house_listing_sample.csv')

### Function Creation

So now since we have a demo of how it works, we now want to make a function(s) for this process. We want several things, one the zip code data itself to get location of each city, but then we also want to scrape for nationwide datasets. It's important to keep note that these are live snippets of the market, so eventually it might be important to schedule runs and then find a place to store this data, but for now we can focus on how to get a static preview of the nations housing market without stressing about time series.

Function to scrape *all* cities:

In [53]:
def listing_redfin_scraper(city_list: list[str], zip_database_path: str) -> pd.DataFrame:
    """
    Scrapes real estate listings data from the Redfin website for a 
    list of cities and returns the data as a Pandas DataFrame.

    Args:
        city_list (list[str]):      A list of strings that contains the names of the cities 
                                    for which you want to scrape real estate listings data.

        zip_database_path (str):    A string that specifies the file path to the ZIP code 
                                    database file that the scraper will use.

    Returns:
        pd.DataFrame:               A Pandas DataFrame that contains the real estate listings 
                                    data for all the cities in the `city_list` parameter.
    """


    # set up the scraper package
    scraper = RedfinScraper()
    scraper.setup(zip_database_path=zip_database_path)

    # starting empty dataframe to concatenate on
    listing_df = pd.DataFrame()

    # loop through each city to gather its listing info
    for city in city_list:
        city_df = scraper.scrape(city_states=[city])
        listing_df = pd.concat([listing_df, city_df])

    # drop instances of duplicates
    listing_df.drop_duplicates(inplace=True)

    return listing_df

In [52]:
city_list = list(filtered_zip_codes['primary_city'] + ', ' + filtered_zip_codes['state']) # how to get a list of every city
city_list = ['Santa Cruz, CA', 'San Francisco, CA'] # sample city list
zip_database_path = "../data/zip_code_database.csv"

listing_redfin_scraper(city_list, zip_database_path)

Unnamed: 0,SALE TYPE,SOLD DATE,PROPERTY TYPE,ADDRESS,CITY,STATE OR PROVINCE,ZIP OR POSTAL CODE,PRICE,BEDS,BATHS,...,STATUS,NEXT OPEN HOUSE START TIME,NEXT OPEN HOUSE END TIME,URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING),SOURCE,MLS#,FAVORITE,INTERESTED,LATITUDE,LONGITUDE
0,MLS Listing,,Single Family Residential,91 Mountain View Rd,Santa Cruz,CA,95065,1700000,5.0,2.5,...,Active,April-22-2023 01:00 PM,April-22-2023 04:00 PM,https://www.redfin.com/CA/Santa-Cruz/91-Mounta...,MLSListings,ML81925230,N,Y,37.054716,-121.976970
1,MLS Listing,,Condo/Co-op,2705 Amberwood Ln,Santa Cruz,CA,95065,600000,2.0,2.5,...,Active,,,https://www.redfin.com/CA/Santa-Cruz/2705-Ambe...,CRMLS,GD23039243,N,Y,36.989184,-121.971728
2,MLS Listing,,Vacant Land,0 Olaughlin Rd,Santa Cruz,CA,95065,649000,,,...,Active,,,https://www.redfin.com/CA/Unknown/Olaughlin-Rd...,MLSListings,ML81923883,N,Y,37.032561,-122.001070
3,MLS Listing,,Single Family Residential,781 Olaughlin,Santa Cruz,CA,95065,1095000,3.0,2.5,...,Active,,,https://www.redfin.com/CA/Santa-Cruz/781-Olaug...,MLSListings,ML81923507,N,Y,37.033047,-122.000159
4,MLS Listing,,Townhouse,3213 Stockbridge Ln,Santa Cruz,CA,95065,900000,3.0,3.0,...,Active,April-22-2023 01:00 PM,April-22-2023 03:00 PM,https://www.redfin.com/CA/Santa-Cruz/3213-Stoc...,"""bridgeMLS, Bay East AOR, or Contra Costa AOR""",41023113,N,Y,36.990164,-121.971828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1508,MLS Listing,,Condo/Co-op,1310 Fillmore St Ph 2-E,San Francisco,CA,94115,879000,1.0,1.0,...,Active,,,https://www.redfin.com/CA/San-Francisco/1310-F...,San Francisco MLS,422696015,N,Y,37.781665,-122.432139
1509,MLS Listing,,Multi-Family (2-4 Unit),1608 -1612 Folsom St,San Francisco,CA,94103,1799000,6.0,4.5,...,Active,,,https://www.redfin.com/CA/San-Francisco/1608-F...,MLSListings,ML81907550,N,Y,37.770781,-122.415372
1510,MLS Listing,,Condo/Co-op,719 Larkin St #502,San Francisco,CA,94109,377285,1.0,1.0,...,Active,,,https://www.redfin.com/CA/San-Francisco/719-La...,San Francisco MLS,422694725,N,Y,37.784526,-122.418088
1511,MLS Listing,,Condo/Co-op,72 Townsend St #807,San Francisco,CA,94107,1549000,2.0,2.0,...,Active,April-23-2023 11:00 AM,April-23-2023 01:00 PM,https://www.redfin.com/CA/San-Francisco/72-Tow...,San Francisco MLS,422693014,N,Y,37.781420,-122.390216
