#### imports etc

In [25]:
# imports
from lxml import html, etree
import requests
import pandas as pd
import datetime as dt

## Function

The purpose of the function is to extract data from the listings on the http://www.rightmove.co.uk/ property website. After passing the function the long url from the first results page of a search, the function will extract the price, property type, address details, and url for the specific property listing. It will also extract the postcode stems from the address details (e.g. 'N1') and store this in a separate column; and extract the number of bedrooms from the property type as a separate column. If more than one page of results are returned by the search then the function will automatically cycle through all pages collecting the data (which means it can take a while to return the results if the search criteria returns thousands of results).

In [26]:
def rightmove_webscrape(rightmove_url):
    
# Get the start & end of the web url around the index value
    start,end = rightmove_url.split('&index=')
    url_start = start+'&index='
    url_end = end[1:]
    
# Initialise variables
    price_pcm=[]
    titles=[]
    addresses=[]
    weblinks=[]
    page_counts=[]
    
# Initialise pandas DataFrame for results.
    df=pd.DataFrame(columns=['price','type','address','url'])

# Get the total number of results from the search
    page = requests.get(rightmove_url)
    tree = html.fromstring(page.content)
    xp_result_count = '//span[@class="searchHeader-resultCount"]/text()'
    result_count = int(tree.xpath(xp_result_count)[0].replace(",", ""))
    
# Turn the total number of search results into number of iterations for the loop
    loop_count = result_count/24
    if result_count%24>0:
        loop_count = loop_count+1
        
# Set the Xpath variables for the loop
    xp_prices = '//span[@class="propertyCard-priceValue"]/text()'
    xp_titles = '//div[@class="propertyCard-details"]//a[@class="propertyCard-link"]//h2[@class="propertyCard-title"]/text()'
    xp_addresses = '//address[@class="propertyCard-address"]/text()'
    xp_weblinks = '//div[@class="propertyCard-details"]//a[@class="propertyCard-link"]/@href'

# Start the loop through the search result web pages
    for pages in range(0,loop_count,1):
        rightmove_url = url_start+str(pages*24)+url_end
        page = requests.get(rightmove_url)
        tree = html.fromstring(page.content)
        
# Reset variables
        price_pcm=[]
        titles=[]
        addresses=[]
        weblinks=[]

# Create data lists from Xpaths
        for val in tree.xpath(xp_prices):
            price_pcm.append(val)
        for val in tree.xpath(xp_titles):
            titles.append(val)
        for val in tree.xpath(xp_addresses):
            addresses.append(val)
        for val in tree.xpath(xp_weblinks):
            weblinks.append(val)

# Convert data to temporary DataFrame
        data = [price_pcm, titles, addresses, weblinks]
        temp_df= pd.DataFrame(data)
        temp_df = temp_df.transpose()
        temp_df.columns=['price','type','address','url']
        
# Drop empty rows from DataFrame which come from placeoholders in html file.
        temp_df = temp_df[temp_df.url != '/property-for-sale/property-0.html']
    
# Join temporary DataFrame to main results DataFrame.
        frames = [df,temp_df]
        df = pd.concat(frames)

# Renumber results DataFrame index to remove duplicate index values.
    df = df.reset_index(drop=True)

# Convert price column to numeric values for analysis.
    df.price.replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
    df.price=pd.to_numeric(df.price)

# Extract postcode areas to separate column.
    df['postcode'] = df['address'].str.extract(r'\b([A-Za-z][A-Za-z]?[0-9][0-9]?[A-Za-z]?)\b',expand=True)
    
# Extract number of bedrooms from 'type' column.
    df['number_bedrooms'] = df.type.str.extract(r'\b([\d][\d]?)\b',expand=True)
    df.loc[df['type'].str.contains('studio',case=False),'number_bedrooms']=0

# Add in search_date column with date website was queried (i.e. today's date).
    now = dt.datetime.today().strftime("%d/%m/%Y")
    df['search_date'] = now

# Optional line to export the results to CSV if you wish to inspect them in an alternative program.
#     df.to_csv('rightmove_df.csv',encoding='utf-8')

    print 'The search returned a total of ', len(df),' results.'
    return df

## Using the function

To use the function you first need to go to http://www.rightmove.co.uk/ and perform your search based on your desired criteria (e.g. 1 bedroom flats to rent in London Fields added to the website in the last 7 days). When the first page of results comes up copy the long url from the browser window and set it as the *rightmove_url* variable (in this example the search is for all residential properties to rent in London added to the website today):

In [33]:
rightmove_url = 'http://www.rightmove.co.uk/property-to-rent/find.html?locationIdentifier=REGION%5E87490&numberOfPropertiesPerPage=24&radius=0.0&sortType=6&index=0&propertyTypes=detached%2Csemi-detached%2Cterraced%2Cflat%2Cbungalow&maxDaysSinceAdded=1&includeLetAgreed=false&viewType=LIST&currencyCode=GBP'

Then simply run the function on the url variable to create the dataframe. Here we'll assign the results to the *df* variable:

In [35]:
df = rightmove_webscrape(rightmove_url)

The search returned a total of  4335  results.


We can look at the first few rows of data to check that the function worked as expected:

In [37]:
df.head()

Unnamed: 0,price,type,address,url,postcode,number_bedrooms,search_date
0,1395,2 bedroom flat,Marmora Road,/property-to-rent/property-55346728.html,,2,24/08/2016
1,7800,4 bedroom terraced house,"New Kings Road, London",/property-to-rent/property-60929015.html,,4,24/08/2016
2,2145,2 bedroom flat,"Queens Gardens, London",/property-to-rent/property-60006455.html,,2,24/08/2016
3,1842,1 bedroom house,"Sloane Avenue Mansions, Sloane Avenue, Chelsea...",/property-to-rent/property-43537104.html,,1,24/08/2016
4,3445,2 bedroom apartment,The Courthouse Horseferry Road London SW1P,/property-to-rent/property-59310035.html,SW1P,2,24/08/2016


And export the full results to *csv* for analysis:

In [38]:
df.to_csv('search_results.csv',encoding='utf-8',index=False)

## Optional html export

In the event that the search does not return results as expected it may be that the Xpaths have changed and need updating; alternatively you may wish to add in additional Xpaths to collect more data. The below will export the full html text file from whichever url you set as the variable *rightmove_url*.

In [39]:
page = requests.get(rightmove_url)
tree = html.fromstring(page.content)
html_text=etree.tostring(tree)
file = open("html.txt", "w")
file.write(html_text)
file.close()