This Notebook is used for scrapping the real estate website Zolo.com using a csv of addresses across Toronto. Since each listing is on a separate page, it is required to scrape multiple pages with unique URLs to get the data for each address.  Luckily, the URL for each page is the same, other than the unique address at the end of the URL. Note that requests_html may not work in Pycharm, however this does work through an Anaconda Jupyter Notebook.

In [3]:
# Import 3rd party libraries\n",
import os
import pandas as pd

from bs4 import BeautifulSoup
import requests

from requests_html import AsyncHTMLSession
import asyncio

# Configure Notebook
import warnings
warnings.filterwarnings('ignore')
%config Completer.use_jedi = False

First, lets read all the addresses from the csv into a Dataframe.  Next, it is important that the format of these addresses matches the format of address listed at the end of the URL.

In [4]:
addresses = pd.read_csv('address-points-4326.csv')
# make every address lower case
addresses['address'] = addresses['ADDRESS_FULL'].apply(lambda x: (x.replace(' ','-')).lower())

# replace shortform street notation with its full form for every street type.
addresses['address'] = addresses['address'].apply(
    lambda x: x.replace('rd','road') if x.endswith('rd') else x).apply(
    lambda x: x.replace('st','street') if x.endswith('st') else x).apply(
    lambda x: x.replace('blvd','boulevard') if x.endswith('blvd') else x).apply(
    lambda x: x.replace('crcl','circle') if x.endswith('crcl') else x).apply(
    lambda x: x.replace('crct','circuit') if x.endswith('crct') else x).apply(
    lambda x: x.replace('crt','court') if x.endswith('crt') else x).apply(
    lambda x: x.replace('cs','close') if x.endswith('cs') else x).apply(
    lambda x: x.replace('dr','drive') if x.endswith('dr') else x).apply(
    lambda x: x.replace('gdns','gardens') if x.endswith('gdns') else x).apply(
    lambda x: x.replace('grn','green') if x.endswith('grn') else x).apply(
    lambda x: x.replace('grv','grove') if x.endswith('grv') else x).apply(
    lambda x: x.replace('gt','gate') if x.endswith('gt') else x).apply(
    lambda x: x.replace('hts','heights') if x.endswith('hts') else x).apply(
    lambda x: x.replace('lwn','lawn') if x.endswith('lwn') else x).apply(
    lambda x: x.replace('pk','park') if x.endswith('pk') else x).apply(
    lambda x: x.replace('pkwy','parkway') if x.endswith('pkwy') else x).apply(
    lambda x: x.replace('pl','place') if x.endswith('pl') else x).apply(
    lambda x: x.replace('ptwy','pathway') if x.endswith('ptwy') else x).apply(
    lambda x: x.replace('rwdy','roadway') if x.endswith('rwdy') else x).apply(
    lambda x: x.replace('sq','square') if x.endswith('sq') else x).apply(
    lambda x: x.replace('ave','avenue') if x.endswith('ave') else x).apply(
    lambda x: x.replace('bdge','bridge') if x.endswith('bdge') else x).apply(
    lambda x: x.replace('ter','terrace') if x.endswith('ter') else x).apply(
    lambda x: x.replace('trl','trail') if x.endswith('trl') else x).apply(
    lambda x: x.replace('wds','woods') if x.endswith('wds') else x)

Create a function that extracts the important information from the HTML script of each web page and returns a single row Dataframe with the type of information as the column name and the value for each column.

In [5]:
def extract_info(script):
    """This function takes in any HTML script with housing information from Zolo.com and returns the 
    geographical information, as well as the housing information in the form of a one line Dataframe"""
    
    #isolate area of code with isResidentialProperty, latitude, longitude, and neighborhood
    loc = script.findAll('script')[1] 
    
    info= {} #create empty dictionary
    
    #Loop through all sections of the isolated code regarding location info and extract relevant parts
    for j in range (5,13):
        info_rough = str(loc).split('\n')[j].replace(" ", "").split(':')
        info[info_rough[0]] = info_rough[1]
        df = pd.DataFrame.from_dict([info]) #create dataframe for first item
    
        # find parts of the script that contain <div> class column-label and column-value
    column_lable = script.findAll('div', class_="column-label")
    column_value = script.findAll('div', class_="column-value")

    column_lable = list(column_lable) # convert the column labels into a list

    # this for loop cleans the name of each column-label by removing the HTML code and leaving the label
    for i in range(len(column_lable)):
        column_lable[i] = str(column_lable[i]).replace(
            '<div class="column-label">',"").replace('</div>',"")

    column_value = list(column_value) # convert the column values into a list

    # this for loop cleans the name of each column value by removing the HTML code and leaving the value
    for i in range(len(column_value)):
        column_value[i] = str(column_value[i]).replace(
             '<div class="column-value"><span class="priv">',"").replace(
            '</span>''</div>',"")
        
    # Try to assign each column value to the corresponding column label in the info dictionary
    try:
        for i in range (len(column_value)):
            info[column_lable[i]] = column_value[i]
    except: # otherwise pass if there are no values of labels
        pass

    df = pd.DataFrame([info]) # create a dataframe of the info dictionary
        
    df.drop(columns = ['sarea','mapArea','propertyId','searchCity'], inplace = True) #drop useless columns

    return df

Since there are over 500,000 addresses in our address list, and it takes about an hour for every 5000, it is useful to scrape it in intervales in case any issues occur. Use the cell below to set the addresses you want to scrape. It is currently set to scrape through the first 10,000 addresses.

In [6]:
addresses_list = list(addresses['address'].iloc[0:10]) # assign range of addresses to scrape through

It is now time to scrape each website.  Zolo.com is set up in a way where you can only see some key information (like sold price) if you make an account.  Once you make an account you get a 'magic link' which allows you to see the full page.  Since this link redirects you from the original page to verification to the full page, typical scrapping did not work. Async had to be used to wait for the page to redirect and fully load.

In [6]:
async def main():
    """This function uses async to access the 'magic link' which allows users to see the full page.
    It loops through the addresses in an addresses_list and returns the HTML script using beautiful soup"""
    
    scripts = [] # create an empty list for all the HTML scripts that get looped through
    
    # this for loop loops through each address in addresses_list and tries to return its HTML script if it exists
    for address in addresses_list:
        try:
            asession = AsyncHTMLSession() # call the AsyncHTMLSession

            # wait for the whole page to redirect and load
            r = await asession.get(
                'https://www.zolo.ca/sign-in?np=d68d4ef8-8ddf-11ed-94e5-bc764e102e1e&nu=' + # magic link
                'https://www.zolo.ca/toronto-real-estate/' # zolo link
                +address) # address portion of the URL
            
            house = r.text # get the text from each web page
            soup = BeautifulSoup(house, 'html.parser') # use beautiful soup to scrape it
            script = soup # assign the soup to script
            scripts.append(script) # append to scripts list
        except:
            pass # if there is no website for this address, pass

    return scripts # return list of all scripts
        
data = await main() # assign the output from the main function to a variable, data

Now that we have scrapped all the HTML script from the addresses in address list, we can use our extract_info function to get the information we want into a Dataframe

In [7]:
dfs = [] # create an empty list

# this for loop loops through all the scripts in data and extracts the important information 
for i in range(len(data)):
    try:
        df = extract_info(data[i]) # apply the extract_info function to each script in data
        df['address'] = addresses_list[i] # name a column address, and add the address of each house
        df.set_index(['address'],drop=True, inplace=True) # set addresses as index
        dfs.append(df) # add this dataframe the dfs list
    except:
        pass # if there is no data in the script, then pass

final = pd.concat(dfs) # concat all Dataframes in dfs to create one DataFrame
final.head()

Unnamed: 0_level_0,isResidentialProperty,propertyLat,propertyLng,searchNeighborhood,Days on Market,Date Suspended,List Date,Last Status,Expiry Date,Unavailable Date,...,Fronting On,Frontage,Lot Depth,Water,Pool,Sewer,Zoning,Cross Street,Municipality District,Lot Code
address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
27-thirty-sixth-street,"true,","""43.5901"",","""-79.5352"",","""long-branch"",",,,,,,,...,,,,,,,,,,
15-muskoka-avenue,"true,","""43.5915"",","""-79.5319"",","""long-branch"",",,,,,,,...,,,,,,,,,,
399-lake-promenade,"true,","""43.5869"",","""-79.5395"",","""long-branch"",",,,,,,,...,,,,,,,,,,
7-hilo-road,"true,","""43.5883"",","""-79.5404"",","""long-branch"",",,,,,,,...,,,,,,,,,,
387-lake-promenade,"true,","""43.5872"",","""-79.539"",","""long-branch"",",,,,,,,...,,,,,,,,,,


Since we only scrapped a portion of the data, lets store that data in its own csv as a checkpoint

In [None]:
final.to_csv('save-0-10,000.csv')

After we have scrapped the data for all addresses, we can merge these checkpoint lists together and create one final csv with all the house information.

In [None]:
house_files = [file for file in os.listdir() if '.csv' in file] # add all csv names to one list

In [None]:
# read and concat each csv for one DataFrame
houses_data = pd.concat(pd.read_csv(open(file)) for file in house_files)

Finally, lets save all our data to one csv for future use

In [None]:
houses_data.to_csv('houses_data.csv')