## Introduction

The purpose of this function is to scrape data from the http://www.rightmove.co.uk property website and convert it to a suitable format for analysis, saving the results as a pandas dataframe, and in a *csv* file which can be easily imported into any analytics platform.

To use the function you first need to go to the rightmove website and perform a search for whatever property type you are interested in - for example, all properties to rent in London. After running the search on the website, copy the long url from the first results page and pass it into the function as the first argument. For the second argument pass either the string 'rent' or 'buy' to denote what has been searched for.

Run the function with the 2 arguments as decribed and it will extract the price, property type, address details, and individual urls for each property listing. Where it finds it the function will also extract the postcode stem from the address details (e.g. 'SW1') and store this in a separate column; and extract the number of bedrooms from the property type as a separate column.

If more than one page of results are returned by your rightmove search then the function will cycle through the pages collecting all the data (so give it a minute to run if your search criteria returns thousands of results!)

In [1]:
def rightmove_webscrape(rightmove_url,rent_or_buy):

# imports
    from lxml import html, etree
    import requests
    import pandas as pd
    import datetime as dt
    
# Get the start & end of the web url around the index value
    start,end = rightmove_url.split('&index=')
    url_start = start+'&index='
    url_end = end[1:]
    
# Initialise the variables which will store the data
    price_pcm, titles, addresses, weblinks =[],[],[],[]

# Initialise a pandas DataFrame to store the results
    df=pd.DataFrame(columns=['price','type','address','url'])

# Get the total number of results returned by the search
    page = requests.get(rightmove_url)
    tree = html.fromstring(page.content)
    xp_result_count = '//span[@class="searchHeader-resultCount"]/text()'
    result_count = int(tree.xpath(xp_result_count)[0].replace(",", ""))
    
# Convert the total number of search results into the number of iterations required for the loop
    loop_count = result_count/24
    if result_count%24>0:
        loop_count = loop_count+1
        
# Set the Xpath variables for the loop
    if rent_or_buy=='rent':
        xp_prices = '//span[@class="propertyCard-priceValue"]/text()'
    elif rent_or_buy=='buy':
        xp_prices = '//div[@class="propertyCard-priceValue"]/text()'
        
    xp_titles = '//div[@class="propertyCard-details"]//a[@class="propertyCard-link"]//h2[@class="propertyCard-title"]/text()'
    xp_addresses = '//address[@class="propertyCard-address"]/text()'
    xp_weblinks = '//div[@class="propertyCard-details"]//a[@class="propertyCard-link"]/@href'

# Start the loop through the search result web pages
    for pages in range(0,loop_count,1):
        rightmove_url = url_start+str(pages*24)+url_end
        page = requests.get(rightmove_url)
        tree = html.fromstring(page.content)
        
# Reset variables
        price_pcm, titles, addresses, weblinks =[],[],[],[]

# Create data lists from Xpaths
        for val in tree.xpath(xp_prices):
            price_pcm.append(val)
        for val in tree.xpath(xp_titles):
            titles.append(val)
        for val in tree.xpath(xp_addresses):
            addresses.append(val)
        for val in tree.xpath(xp_weblinks):
            weblinks.append('http://www.rightmove.co.uk'+val)

# Convert data to temporary DataFrame
        data = [price_pcm, titles, addresses, weblinks]
        temp_df= pd.DataFrame(data)
        temp_df = temp_df.transpose()
        temp_df.columns=['price','type','address','url']
        
# Drop empty rows from DataFrame which come from placeoholders in rightmove html
        temp_df = temp_df[temp_df.url != 'http://www.rightmove.co.uk'+'/property-for-sale/property-0.html']
    
# Join temporary DataFrame to main results DataFrame
        frames = [df,temp_df]
        df = pd.concat(frames)

# Renumber results DataFrame index to remove duplicate index values
    df = df.reset_index(drop=True)

# Convert price column to numeric values for analysis
    df.price.replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
    df.price=pd.to_numeric(df.price)

# Extract postcode stems to a separate column
    df['postcode'] = df['address'].str.extract(r'\b([A-Za-z][A-Za-z]?[0-9][0-9]?[A-Za-z]?)\b',expand=True)
    
# Extract number of bedrooms from 'type' to a separate column
    df['number_bedrooms'] = df.type.str.extract(r'\b([\d][\d]?)\b',expand=True)
    df.loc[df['type'].str.contains('studio',case=False),'number_bedrooms']=0

# Add in search_date column to record the date the search was run (i.e. today's date)
    now = dt.datetime.today().strftime("%d/%m/%Y")
    df['search_date'] = now

# Export the results to CSV 
    csv_filename = 'rightmove_results_'+str(dt.datetime.today().strftime("%Y_%m_%d %H %M %S"))+'.csv'
    df.to_csv(csv_filename,encoding='utf-8')

# Print message to validate search has run showing number of results received and name of csv file.
    print len(df),'results saved as \'',csv_filename,'\''
    return df

## Example use of the function

In this example I have gone to http://www.rightmove.co.uk/ and performed a search for 1 bedroom flats to rent in the London Fields area of East London, filtering to show only listings added to the website in the last 7 days. From the first page of results I have copied the long url from the adddres bar, and am setting it to a variable called *rent_url*:

In [2]:
rent_url = 'http://www.rightmove.co.uk/property-to-rent/find.html?locationIdentifier=REGION%5E70417&numberOfPropertiesPerPage=24&radius=0.0&sortType=6&index=0&propertyTypes=detached%2Csemi-detached%2Cterraced%2Cflat%2Cbungalow&includeLetAgreed=false&viewType=LIST&areaSizeUnit=sqft&currencyCode=GBP'

Now I simply run the function on this variable, passing 'rent' as the second argument since I have searched for rental properties:

In [3]:
df = rightmove_webscrape(rent_url,'rent')

200 results saved as ' rightmove_results_2016_12_19 12 16 17.csv '


We can look at the first few rows of data to see how the results appear:

In [4]:
df.head()

Unnamed: 0,price,type,address,url,postcode,number_bedrooms,search_date
0,2925,3 bedroom apartment,"Vibe Apartments, Dalston, E8",http://www.rightmove.co.uk/property-to-rent/pr...,E8,3,19/12/2016
1,2817,3 bedroom flat,"Parkholme Road, London, E8",http://www.rightmove.co.uk/property-to-rent/pr...,E8,3,19/12/2016
2,2817,3 bedroom maisonette,"Parkholme Road, Hackney",http://www.rightmove.co.uk/property-to-rent/pr...,,3,19/12/2016
3,2000,2 bedroom flat,"Albion Drive, London, E8",http://www.rightmove.co.uk/property-to-rent/pr...,E8,2,19/12/2016
4,1712,1 bedroom flat,"Atkins Square, Hackney Downs, E8",http://www.rightmove.co.uk/property-to-rent/pr...,E8,1,19/12/2016


And finally just an example of what can be done with the data - let's produce the link(s) for the cheapest listings returned by our search:

In [5]:
import pandas as pd
pd.options.display.max_colwidth = 150
df[df.price==df.price.min()][['price','url']]

Unnamed: 0,price,url
165,700,http://www.rightmove.co.uk/property-to-rent/property-61807508.html


### Error checking

In the event that the search does not return results as expected it may be that the Xpaths have been changed in the html source code. The below will export the full html text file for inspection from whatever url you set as the variable *url*:

In [11]:
from lxml import html, etree
import requests
url = rent_url
page = requests.get(url)
tree = html.fromstring(page.content)
html_text=etree.tostring(tree)
file = open("html.txt", "w")
file.write(html_text)
file.close()