# Web Scrape Realtor Data for Properties in Bay Area

As part of this workbook we will scrape the bay area for all the zips to fetch the data for the current resedential properties listes in 
1. 'Santa Clara County'
2. 'San Mateo County'
3. 'Alameda County'
4. 'Contra Costa County'
5. 'Marin County'
6. 'Napa County'
7. 'San Francisco County'
8. 'Solano County'
9. 'Sonoma County' 

which is essentially bay Area.

### Goal

The goal of this exercise is scrape the data and use it to understand the current listings and what is the price changes, listing pattern and more details around the zoning of the prices. 

Which zips are hot in the market now and how the general trend of the properties looks like.

# Please Note - 

You need to make sure of following things before you execute this code and ensure the requirement are met.

1. Ensure you have the zipcode and income data file in the same directory as of code. 
2. Make sure you change the flag value to False in the main function call main(False) if you are running for the first time. For second time, ensure you change it to false, so as to not consider the properties already fetched.
3. Change the values of county_list variable in main fuction to run the code for the counties of your interest. 



Also the code does not stop in case of bad message from the website. THis was done in order to continue to scrape it even if one of the header was blocked. However, by the end the website might have made some changes which kind of blocking all the calls irrespective of the header from the same IP.

Code will end abruplty if it faces issue with processing the links

Should you have any questions please let us know.

dsaraswat2@horizon.cseastbay.edu

### Import the libraries needed to connect and parse the data

In [1]:
# Below libraries are needed for the processing of the url requests, applying time functions for wait,
# checking the status of the url request and beautiful soup to parse the response of the website response
# json to process the formmatted ingested data from the website and pandas for mathematical operations

import requests
import re
import time  # Import the time module
from urllib.request import urlopen
import urllib.request 
import urllib.parse
import urllib.error
import ssl
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as BS
import json
import pandas as pd
import csv
import random
import os
from pathlib import Path

# Function to fetch the soup URL.

In [2]:
## This is the function to fetch the status of the link passed and pass the extracted output/reponse as output
## Function returns soup or none based on the response from the website.

# The logic is to randomly select a header and use to ping to the server to get the resp. 
# if soup response is anything other than what BS can process it returns None.

def home_for_sale_soup(link):
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE  # The above piece of three lines to ensure that website fakes us for a genuine request  and its a secure connection 
    url = link
    user_agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 OPR/94.0.0.0'
        #'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)'

    ]
    try:
        # Create the requests object and read the response from the wesbite in a object
        usr_agent = random.choice(user_agents)
        req = Request(url, headers={'User-Agent':usr_agent})
        webpage = urlopen(req).read()
        #print(webpage)
        # Parse the object with Beautiful Soup with the html parser.
        soup=BS(webpage,'html.parser')
        return soup
    except Exception  as e:
        print("An error has occured:", str(e))
        print("This is the user agent used ", usr_agent)
        return None

## Function to identify if the zipcode search has any property listed or not.

In [3]:
### This is the function to check for the zips where there are no properties listed. 
# It returns True if its None or else it returns False

def check_if_no_results(soup):
    no_results_find = soup.find("p",{"data-testid":"results-header-nomatch-results"})
    if no_results_find is not None:
        return True
    else:
        return False

## Function to Search Property urls on the search page results

In [4]:
## This function takes the search soup as input and search for the property page links and return a list of 
# the links of each property in the main page of the realtor.com. 
def links(soup):
    try:
        links = soup.find_all('div',{"class":"card-image-wrapper"}) #find all searches for the div tag with class
        # as card-image-wrapper
        list_of_links = []
        for item in links:
            list_of_links.append('https://www.realtor.com'+item.a.get('href').replace("?from=srp-list-card",""))
        return list_of_links
    except Exception as e:
        print("Issue with fetching the links")
        return None

# Function to fetch the json data from the property soup

In [5]:
def data_from_prop_url(soup):
    try:
        text_data = soup.find('script',{'type':'application/json'}).text
        if text_data is not None:
            return text_data
        
    except Exception as e:
        print("Issue with fetching the links")
        return None

## Function to get the number of pages for the property search result

Some zip codes will have more than one page results so to identify all the pages we can look for the results feed and each page has 42 properties so we calculate the number of pages based on that and scan each zip result for that many pages.

For example say one zip has more 45 properties then it will have two pages and hence the we will go to page 2 for the zipcode as well. 


## Function to build the list of urls for each page for a zipcode

In [6]:
## Build the links for all the pages for soup to be called. 
# We use the base url and pass the zipcode as input to build the links for a zipcode property search
# Given that we may have more than 1 pages as search result output we will iterate over 
# the number of pages and build the url for each page of the of the zipcode search result.

def build_zip_links(url, num_of_pages):
    base_url = url
    num_of_pages = num_of_pages
    search_result_url_list = []
    #Here iterate over the number of pages and if its 0 then build the links differently then for rest of the pages
    for i in range(num_of_pages+1):
        if i == 0 :
            #link=base_url
            #search_result_url_list.append(link)
            continue
        else : # Here append the /pg- page number to the link to get to the next page the increment is to 
            # ensure that given that pages start with pg2 and so forth and hence the pages are increases in such method
            link = base_url+'/pg-'+str((i+1))
            search_result_url_list.append(link)
    return search_result_url_list
            
    

# Read zip_code data file and process it in pandas 

This function process the zipcode data file which has the zipcode and its details for all the areas in USA.

We will fetch the data for only CA and clean some data issues. 

Then we will get the zipcode for the area of interest or counties as input to get the zipcode_df

In [7]:
## Process the zip_code data file and load the data for California and list of counties of interest to a dataFrame.

def zip_file_processing(path, county_list):
    
    filename = path
    zipcode_df = pd.read_csv(filename)
    zipcode_df = zipcode_df[(zipcode_df["state"] == "CA")]
    zipcode_df["county"].fillna('San Joaquin County', inplace=True)
    zipcode_df[zipcode_df["county"].isna()]
    county_list = county_list
    county_zipcode_df = zipcode_df[zipcode_df["county"].isin(county_list)]
    
    return county_zipcode_df

# Function to write the data to a file

Instead of opening a file and writing the data each time you need to write a file, use this function with the filename, data and file_type which will write the data to the file. 


In [8]:
# The function to write the data in a file with the filetype as well. 
## The fucntionality of the filetype has been removed later, however the variable remained to support the underline code.

def write_data_to_file(filename, data,file_type):
    filename = filename
    data = data
    file_type = file_type
    with open(filename,'a') as file:
        file.write(data+'\n')
        

In [9]:
def num_of_pages_for_zip(soup):
    soup = soup
    try:
        results = soup.find('div',{"data-testid":"total-results"}).text #find all searches for the div tag with class
        #print(results)
        # as card-image-wrapper
        if results is not None:
            num_of_pages = int(results)/42
            #print(num_of_pages)
            return num_of_pages
        else: 
            return None
    except Exception as e:
        print("Issue with fetching the counts")
        return None
    

In [10]:
def remove_duplicate(filename,outfile):
    try:
        unique_line = set()
        with open(filename,'r') as file:
            for line in file:
                unique_line.add(line.strip())
            
        with open(outfile,'w') as file1:
            for line in unique_line:
                file1.write(line+'\n')
        return True
    except Exception as e:
        return False



# Process Flow to Fetch the links for all the properties

We will process the zip_code_database.csv ("zip data") to fetch the zipcodes for the counties of interest. 

For those zip code, we will create the URL for searching properties listed on realtor.com.

We will iterate over all the zipcode search result pages and fetch the links for the properties listed on those pages. 

Those fetched links will be stored in a property_links.csv file which will finally be used later to fetch the details for each individual property for data about -

Property details, price, type , tax history, price history, school details, neighbourhood details and much more. 


In [11]:
# Function to fetch links for all the properties for the zip code in the counties list.

def web_scrape_search_pages(base_url, zip_file_path, csv_file_path, property_link_file_path,
                           county_list):
    
    # Initialize the variables here for the function to run the code. 
    base_url = base_url
    zip_file_path = zip_file_path
    csv_file_path = csv_file_path
    property_link_file_path= property_link_file_path
    county_list = county_list
    
    url_list = []
    # Load the zipcode data for a each county iteratively and get the zipcodes.
    county_zipcode_df = zip_file_processing(zip_file_path, county_list)
    
    try:
        # Now start building the urls for each of the zipcodes iteratively
        for county in county_list:
            zips = county_zipcode_df.loc[county_zipcode_df["county"] == county,["zip","primary_city","county"]]
            print("Working on county : ", county)
            zip_codes = list(zips["zip"])
            # Build the links for each zipcode for base url search
            for i in zip_codes:
                time.sleep(random.randint(3, 6))
                url = base_url + str(i)
                print("Working for zipcode : ", i)
                # Check the soup for that zip code using the home_for_sale_soup function by passing the link
                soup = home_for_sale_soup(url)
                #print(soup)

                # Check the return object and notice if we have received the proper response.
                if soup is not None:
                    #print("Hello")

                    # First check if we have any results for that zipcode or note if not then continue to next zip
                    if check_if_no_results(soup):
                        print("There is no results for the zipcode : ", i)
                        continue

                    # parse the data for the main page first and check the number of pages on the main page -

                    num_pages = num_of_pages_for_zip(soup)
                    #print(num_pages)
                    if num_pages is None:
                        print("Could not fetch the number of pages for zip:",i)
                        raise Exception("Sorry, could not fetch the numbers")  

                    # Get the links for the properties for the main page already downloaded. 
                    # My approach is to store the links in a list here as well as load the links to a csv file as well
                    # This to ensure that we have skim through these links and properties have been scanned.

                    #write_data_to_file(csv_file_path,url,'csv')

                    # Get the links for the properties of this page -
                    prop_links = links(soup)

                    #Check if the links are fetched properly or not
                    if prop_links is not None:
                        for item in prop_links:
                            write_data_to_file(property_link_file_path,item,'csv')

                    else:
                        print("Breaking the loop due to error at links of property fetch")
                        raise Exception("Issue with property links fetch for ", i)
                        break

                    # Now I have the number of pages and zip code as well so I will call the build the url for the zipcode
                    if num_pages > 1.0 : # if the number of pages are more than the one page then build the links for all the pages

                        zip_code_next_pgs_links = build_zip_links(url,int(num_pages))
                        # Lets iterate over the constructed zipcode list of links for all pages
                        for item in zip_code_next_pgs_links: # For all the pages write the link to the zipcode links
                            write_data_to_file(csv_file_path,item,'csv')
                            # Now call the soup to get the property url from each page using item.
                            page_soup = home_for_sale_soup(item)
                            if page_soup is not None:
                                #get the property url's
                                page_prop_link = links(page_soup)
                                if page_prop_link is not None:
                                    for i in page_prop_link:
                                        write_data_to_file(property_link_file_path,i,'csv')
                                else:
                                    print("Breaking the inner loop due to error at the links of property fetch")
                                    break
                            else:
                                print("Issues with getting the soup let's break the loop and get out")
                                break

                    else:
                        print(f"Issue with processing the zipcode for {i} ")
                        break
        return True
    except Exception as e:
        print("Issue with data processing , exiting abruptly for zip code : ",i)
        return False


## Function to remove the data from property links data file

## Main function

As part of this function we will call the process flow. 

The code to fetch the data for the individual properties has been not put into the function to avoid the continues running and have more adhoce changes made to the flow for the need basis

Here you should be focusing on updating the values for the following variables in the main function -

1. county_list --> it is a list of all the counties for whom you need to fetch the details for the properties.
in our case it is for the CA bay area counties.
2. result_passed --> You should pass its values as True or False while calling the main function. 

You should pass it as False for the first time call to make sure it searches for the properties listed in each zip code else it will fetch for those properties

After one run when you have all the links we can call this function many a times by passing the values as True.

In [12]:
# Define the variables and call the function to test the process flow.
def main(result_passed):
    base_url = "https://www.realtor.com/realestateandhomes-search/"

    #Get the zipcodes from by reading from the zipcode data file. 
    zip_file_path = "zip_code_database.csv"
    # Define the files to write the zipcode links 
    csv_file_path = 'zip_code_links.csv'
    property_link_file_path = 'property_links.csv'
    
    # Change the county list as mentioned in the comments underline to fetch the data for counties in California
    #county_list = ['Santa Clara County','San Mateo County',
    #               'Alameda County', # Done with this as well
    #               'Contra Costa County','Marin County','Napa County', # are alreadys scanned in first pass
    #               'San Francisco County',  # Done with SFO will probably scan it later as well
    #               'Solano County','Sonoma County']
    county_list=['Solano County']
    
    # Delete the data from duplicate data from the property_links file.

    filename = 'property_links.csv'
    outfile = 'property_links_clean.csv'
    property_data_dir = 'property_data/'

    
    #result = web_scrape_search_pages(base_url, zip_file_path, csv_file_path, property_link_file_path,county_list)
    if result_passed is True:
        result = True
    else:
        result = web_scrape_search_pages(base_url, zip_file_path, csv_file_path, property_link_file_path,county_list)
    
    # I am keeping the results as True always to ensure that we have some links to process always.

    if result:
        # Remove the duplicate properties from the properties links
        if remove_duplicate(filename,outfile):
            
            # Check if we have the folder already created or not. If not then create the folder here.
            os.makedirs(property_data_dir,exist_ok=True)
            
            
            # Do the processing for fetching the links, it could have been a function call instead we write the full
            # full code. This is in order to have control to manually kill the code to avoid IP bns.
            
            with open(outfile,'r') as file3:
                for line in file3:
                    # For each link call the soup function get the soup.
                    # Again we will sleep for random time before we make a new request. 
                    url = line.strip()
                    prop_file_name = property_data_dir+url.rsplit('/',1)[-1]+'.json'
                    # Now let's call the random function to get the sleep before a new request. 
                    #time.sleep(random.randint(2,6))
                    #time.sleep(20)
                    try:
                        prop_soup = home_for_sale_soup(url)
                        #print(prop_soup)
                        if prop_soup is not None:
                            # Call the soup parser to fetch the data fields. 
                            data = data_from_prop_url(prop_soup) # Get the fields data from the soup

                            json_data = json.loads(data)  # Convert the data to a json data 

                            #outcome = get_data_fields_for_property(json_data) # Get the data fields from that JSON

                            print("writing for property: ",prop_file_name)

                            with open(prop_file_name,mode='w') as file:
                                json.dump(json_data,file)
                                file.write('\n')
                                #print("Write the content for the file correctly,", url)
                            #time.sleep(random.randint(4,10))

                        else:
                            print("Issue with fetching the data for the property",url)
                            time.sleep(random.randint(15,16))
                            continue
                    except Exception as e:
                        print("Issue with the soup fetching ", e, url)
                        time.sleep(random.randint(15,16))
                        continue
            #### This is where the fetching should stop
        else:
            print("Issue with removing the duplicated from file : ", filename)
    else:
        print("We should never come here !!!")

## Call the Main function

Call the main function to fetch the properties data based on the status of your current execution.

I have set the value as True for now, you should change it to False if you are running it for the first time

In [18]:
# Change the value to False should you be executing my code for the first time. 
# This will fetch the data for the zip codes.

#Note- Please make sure you have the zip-code file in the same folder as the code. 
#Note2- Please kill the code manually if it starts showing you 403 or any such errors. 

# The reason to keep code killing manual is to ensure that we do not get into IP block status
### This could have been automated, however in the interest of time and the scope of project I have limited
### its capacity of execution to allow more control over let the code do stuff manually

main(True)

An error has occured: HTTP Error 403: Forbidden
This is the user agent used  Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50
Issue with fetching the data for the property https://www.realtor.com/realestateandhomes-detail/482-Panorama-Dr_Benicia_CA_94510_M24989-57457


KeyboardInterrupt: 

In [None]:
soup = home_for_sale_soup('https://www.realtor.com/realestateandhomes-search/94510')
results = num_of_pages_for_zip(soup)
results