# Web scraping with BeautifulSoup

In this notebook, we will look at web scraping with the BeautifulSoup package. We will collect real time data about house prices, and organise the information into a nice data set.

To begin, we need to choose a website that contains the sold prices for properties in Sydney North Shore. I chose domain.com.au as it the sold prices of lots of properties in sydney. As always, before scraping it is important to follow good etiquette, as per the robot.txt file. Domain.com.au allows scraping for the particular webpage that I am interested in.

From domain.com, we will scrape information for each property, including:
1. Street Adress
2. Sold Price
3. Sold Date
4. Sold Method
5. Postcode
6. State
7. Suburb
8. Bedroom
9. Bathroom
10. Parking
11. Area

I have also added a search filter to only those from Sydney North Shore, and only those that do not have price withheld.

Let's import the libraries that we will need.

In [1]:
# import libraries
from bs4 import BeautifulSoup # cleans up the html retrieved
from requests import get # package used to retrieve html from web page
import re # regex for strings to extract what we want from each html
import pandas as pd

We will now initialise the lists. These lists will be the columns in our cleaned data set.

In [2]:
# Initialising lists, that will form our dataframe
pagenum = []
street = []
soldprice = []
solddate = []
soldmethod = []
postcode = []
state = []
suburb = []
bedroom = []
bathroom = []
parking = []
area = []

We will also need some functions for cleaning some html later, so let us define them now.

In [3]:
# Function to clean a string and remove unwanted characters
def clean_text(string):
    '''removes unwanted characters from a string'''
    
    chars = "\\`*_{}[]()>#+-.!$,'" # list of characters to remove from string
    for c in chars:
            string = string.replace(c,'').strip() # remove unwanted characters
    return string

In [None]:
# Function to check if string has numbers
def hasNumbers(inputString):
    '''check if a string has numbers'''
    return any(char.isdigit() for char in inputString)

Web scraping is based on live data on the website. One thing about the webpage that we are interested in is that due to dynamic changes (such as there sometimes being a featured item listed) on the website, the number of properties found may not match between different containers. (This is clearer in the code). For this reason, I set two variables as a reconciliation check, which tells us whether there is a featured item.

In [4]:
# reconciliation index
recon1 = 0
recon2 = 0

Finally, we loop through every page, and extract the information we need. Some things to note:
- The scraping is done in two loops.
    - Loop 1: To get more listings, we loop through the first 50 pages of listings. To do this, we have to update the url, thus this is updated in the first loop
    - Loop 2: For each web pag, we need to loop through each listing. We find a piece of html text that is a good break to loop through. In this case, each container represents a new listing, as defined by the html below
- To look at the html of a page, you can right click and 'Inspect' (or just 'Ctrl + Shift + I') for google chrome users.
- 'get' from the response package is used to read in the raw html, which is then cleaned using 'BeautifulSoup'

In [5]:

# Loop though every web page
for page in range(1,51):
    
    print("Scraping page ", page, " of 50")
    
    # Get the url we want (filtered to only those with prices)
    # We will only get suburbs around the north shore area due to 50 page
    # limitation.    
    url = "https://www.domain.com.au/sold-listings/?suburb=chatswood-nsw-206"+\
    "7,castlecrag-nsw-2068,castle-cove-nsw-2069,lindfield-nsw-2070,lane" +\
    "-cove-north-nsw-2066,crows-nest-nsw-2065,artarmon-nsw-2064,northbr" +\
    "idge-nsw-2063,cammeray-nsw-2062,kirribilli-nsw-2061,lavender-bay-ns"+\
    "w-2060,killara-nsw-2071,gordon-nsw-2072,pymble-nsw-2073,turramurra-"+\
    "nsw-2074,st-ives-nsw-2075,wahroonga-nsw-2076,willoughby-nsw-2068,ro"+\
    "seville-nsw-2069,hornsby-nsw-2077,waverton-nsw-2060,wollstonecraft-"+\
    "nsw-2065,north-sydney-nsw-2060,milsons-point-nsw-2061,warrawee-nsw-"+\
    "2074,east-lindfield-nsw-2070,east-killara-nsw-2071&excludepricewithh"+\
    "eld=1" + "&page=" + str(page)
    
    response = get(url)
    
    # Format using soup
    html_soup = BeautifulSoup(response.text, 'html.parser')
    
    # Grab all the details, separated by each property:
    # right -> all property details except sold method and date
    # left -> property sold method and date
    house_container_right = html_soup.find_all('div',\
        class_="listing-result__details listing-result__right")    
    house_container_left = html_soup.find_all('span',\
        class_="listing-result__tag is-sold")

    # Loop through each property
    for container in house_container_right:
        
        recon1 += 1 # increment reconciliation index
        
        pagenum.append(page) # keep track of which web page, for debugging
        
        # Extract the information we need for each property
        house_price = container.find('p', class_="listing-result__price")
        house_street = container.find('span', class_="address-line1")
        house_location = container.find('span', class_="address-line2")
        house_interior = container.find('div',\
         class_="property-features__default-wrapper")
    
        #==================== Street ================
        if not house_street:
            street.append("-")
        else:
            street.append(clean_text(house_street.text))
        
        #================== Sold price ==============
        if not house_price:
            soldprice.append("-")
        else:
            soldprice.append(re.sub("[^0-9]", "",clean_text(house_price.text)))
            
        #========= Postcode, State, Suburb ==========
        # Location contains postcode, state and suburb.
        # We will need to split out this information.
        
        if not house_location:
            postcode.append("-")
            state.append("-")
            suburb.append("-")
        else: 
            line = house_location.text # grab only the text from house_location
            
            # regex to separate the postcode, state and suburb
            postcode_regex = re.search(".*([0-9][0-9][0-9][0-9])", line)
            state_regex = re.search(".* (.+?) [0-9][0-9][0-9][0-9]", line)
            suburb_regex = re.search("(.*) [A-Z]* [0-9][0-9][0-9][0-9]", line)
            
            # append to each list        
            if not postcode_regex:
                postcode.append("-")
            else:
                postcode.append(postcode_regex.group(1))
            if not state_regex:
                state.append("-")
            else:
                state.append(state_regex.group(1))
            if not suburb_regex:
                suburb.append("-")
            else:
                suburb.append(suburb_regex.group(1))
        
        #======= Bedroom, Bathroom, Parking ========
        if not house_interior:
            bedroom.append(float('NaN'))
            bathroom.append(float('NaN'))
            parking.append(float('NaN'))
            area.append(float('NaN'))
        else:
            line = house_interior.text
            
            # regex to separate the postcode, state and suburb
            bedroom_regex = re.search("([0-9]+) Beds.*", line)
            bathroom_regex = re.search("([0-9]+) Baths.*", line)
            parking_regex = re.search("([0-9]+) Parkings.*", line)
            area_regex = re.search("([0-9]+)m²", line)
            
            if not bedroom_regex:
                bedroom.append(float('NaN'))
            else:
                bedroom.append(bedroom_regex.group(1))
            if not bathroom_regex:
                bathroom.append(float('NaN'))
            else:
                bathroom.append(bathroom_regex.group(1))
            if not parking_regex:
                parking.append(float('NaN'))
            else:
                parking.append(parking_regex.group(1))
            if not area_regex:
                area.append(float('NaN'))
            else:
                area.append(area_regex.group(1))
        
    # Loop through each property
    for container in house_container_left:
        
        recon2 += 1
        
        #=============== Sold date, Sold method ===========
        line = container.text
        
        # Check if the sold information includes a date
        if hasNumbers(line):
            
            # regex to separate the sold method and sold date
            soldmethod_regex = re.search("(.*) [0-9].* [a-zA-Z].* [0-9].*",\
                                         line)
            solddate_regex = re.search(".* ([0-9].* [a-zA-Z].* [0-9].*)", line)
            
            if not soldmethod_regex:
                soldmethod.append("-")
            else:
                soldmethod.append(soldmethod_regex.group(1))
            if not solddate_regex:
                solddate.append("-")
            else:
                solddate.append(solddate_regex.group(1))
        else:
            if not line:
                soldmethod.append("-")
            else:
                soldmethod.append(line)
            
            solddate.append("-")
            
    # If first property is a featured property, it may not have all the
    # properties we need, so pop it off. Note that there is not always a 
    # featured item
    if page == 1 and recon1 == recon2 + 1:
        print("Exclude Featured Item")
        pagenum.pop(0)
        street.pop(0)
        soldprice.pop(0)
        postcode.pop(0)
        state.pop(0)
        suburb.pop(0)
        bedroom.pop(0)
        bathroom.pop(0)
        parking.pop(0)
        area.pop(0)
        recon1 -= 1

    # Check to see that there are no mismatch in property count
    if not recon1 == recon2:
        print("Page ", page, " had inconsistent property count. Recon1 = ",
              recon1, ", Recon2 = ", recon2)
    recon1 = recon2 = 0

Scraping page  1  of 50
Exclude Featured Item
Scraping page  2  of 50
Scraping page  3  of 50
Scraping page  4  of 50
Scraping page  5  of 50
Scraping page  6  of 50
Scraping page  7  of 50
Scraping page  8  of 50
Scraping page  9  of 50
Scraping page  10  of 50
Scraping page  11  of 50
Scraping page  12  of 50
Scraping page  13  of 50
Scraping page  14  of 50
Scraping page  15  of 50
Scraping page  16  of 50
Scraping page  17  of 50
Scraping page  18  of 50
Scraping page  19  of 50
Scraping page  20  of 50
Scraping page  21  of 50
Scraping page  22  of 50
Scraping page  23  of 50
Scraping page  24  of 50
Scraping page  25  of 50
Scraping page  26  of 50
Scraping page  27  of 50
Scraping page  28  of 50
Scraping page  29  of 50
Scraping page  30  of 50
Scraping page  31  of 50
Scraping page  32  of 50
Scraping page  33  of 50
Scraping page  34  of 50
Scraping page  35  of 50
Scraping page  36  of 50
Scraping page  37  of 50
Scraping page  38  of 50
Scraping page  39  of 50
Scraping pag

We know have all the data that we have scraped, across 50 different pages! We now convert the lists into one panda dataframe, by first making a dictionary.

In [6]:
# Convert lists to a panda dataframe
d = { # create dictionary of the lists
    'Page':pagenum,
    ,'Street':street
    ,'Sold Price':soldprice
    ,'Sold Date':solddate
    ,'Sold Method':soldmethod
    ,'Postcode':postcode
    ,'State':state
    ,'Suburb':suburb
    ,'Bedroom':bedroom
    ,'Bathroom':bathroom
    ,'Parking':parking
    ,'Area':area
}

# Converrt dictionary to a dataframe
df = pd.DataFrame(d)
display(df)

Unnamed: 0,Page,Street,Sold Price,Sold Date,Sold Method,Postcode,State,Suburb,Bedroom,Bathroom,Parking,Area
0,1,3a Orinoco Street,2180000,28 Jun 2019,Sold by private treaty,2073,NSW,PYMBLE,5,3,4,404
1,1,1/1115 Curagul Road,1485000,28 Jun 2019,Sold by private treaty,2074,NSW,TURRAMURRA,4,2,2,
2,1,54/2 Warrangi Street,712000,28 Jun 2019,Sold by private treaty,2074,NSW,TURRAMURRA,2,2,,
3,1,13/655A Pacific Highway,1320000,27 Jun 2019,Sold prior to auction,2067,NSW,CHATSWOOD,3,2,2,
4,1,3/191 West Street,890000,27 Jun 2019,Sold by private treaty,2065,NSW,CROWS NEST,2,,,
5,1,1406/90 George Street,718000,27 Jun 2019,Sold by private treaty,2077,NSW,HORNSBY,2,2,,
6,1,103/38 Alfred Street,905000,27 Jun 2019,Sold prior to auction,2061,NSW,MILSONS POINT,,,,
7,1,10 Lynbara Avenue,1680000,26 Jun 2019,Sold at auction,2075,NSW,ST IVES,4,,2,949
8,1,8 Toolang Road,2033000,26 Jun 2019,Sold at auction,2075,NSW,ST IVES,4,2,2,929
9,1,15 Kirkpatrick Street,1800000,26 Jun 2019,Sold at auction,2074,NSW,TURRAMURRA,4,3,2,003
