# Neighbourhood insights for buying a home in Edinburgh

## Table of Contents  
- [Introduction](#introduction)
- [Business problem](#introduction.business_problem)
    - [Static ranking system](#introduction.static_ranking_system)
    - [Dynamic ranking system](#introduction.dynamic_ranking_system)
- [Data](#data)
    - [Overview](#data.overview)
    - [Foursquare venue data](#data.foursquare_venue_data)
    - [Rightmove property sale price data](#data.rightmove_property_sale_price_data)

## Introduction <a name="introduction"/>

### Business problem <a name="introduction.business_problem"/>

A major estate agent in **Edinburgh** would like to provide more information about neighbourhoods to their clients. Local amenities are very important in deciding to buy a home, in addition to the property itself. This information, however, is not provided by the estate agent to the same quality as details about the property itself. **Presenting valuable insights about local amenities to potential buyers could attract more customers,** particularly those new to the city.

The insights about local amenities should be provided to the home buyer in a format that can be directly used in making their decision. Firstly, the information should be easy and quick to understand. Secondly, it should allow intuitive comparison between available properties. Thirdly, it should be objective truth, based on statistics, and not a biased opinion.

**This project aims to provide a solution for informing home buyers about the neighbourhoods in Edinburgh.** We will achieve this by generating two ranking systems. Firstly, **a static ranking system** for common preference types, such as favouring nightlife venues over parks or grocery stores over restaurants. Secondly, **a dynamic ranking system** that provides a ranking of neighbourhoods based on the client's personal preferences and purchase price range.

### Static ranking system <a name="introduction.static_ranking_system"/>

The static ranking system will be created using **k-means clustering** of neighbourhoods based on the local amenities and identifying preference categories in the resulting clusters.

### Dynamic ranking system <a name="introduction.dynamic_ranking_system"/>

The dynamic ranking system will be ranking how well each neighbourhood matches the ideal neighbourhood based on user preferences. User input will be quantified relative to the **distribution of each feature across all neighbourhoods**. 

## Data <a name="data"/>

### Overview <a name="data.overview"/>

For this project we will need data on venues and amenities across Edinburgh and property sale price data. We will acquire the data on venues and amenities using Foursqare API. The sale price data will be acquired using web scraping on Rightmove website.

The neighbourhoods will be overlapping circular grid fields positioned uniformly across Edinburgh. All venues will be assigned to and mean property sale prices will be computed for these artifical neighbourhoods. These neighbourhoods with the resulting features will be the subject of the statistical and machine learning methods.

### Foursquare venue data <a name="data.foursquare_venue_data"/>

Foursquare only returns 50 venues per request, therefore, to get detailed data on venues and amenities each neighbourhood, we need to use multiple requests.

We will compute uniformly distributed positions across Edinburgh that will be the center points of our Foursquare API calls to collect locations of all the venues in categories of interest (e.g. cafe, turkish restaurant, night club, grocery store, gym).

We will arrange these examples into a `pandas.DataFrame` that will contain venue category, rating, latitude and longitude.

The interface with the Foursqare API will be re-using the code in [my Assessment 3 submission](https://nbviewer.jupyter.org/github/sanntann/Coursera_Capstone/blob/master/Assessment_3.ipynb)

### Rightmove property sale price data <a name="data.rightmove_property_sale_price_data"/>

Rightmove is a major UK property website. They provide a list of sale price data going back several years. 

For some of the properties on the list there is a link to a post on Rightmove website and information on the type of the property, including number of bedrooms. As the latter information is a major determinant of sale price, we will only use data on properties where this information is available. This will allow more client preference specific estimation of mean property prices in neighbourhoods.

Rightmove provides the address for each property, including the postcode. We will use a Edinburgh postcode latitude and longitude dataset to approximate the latitude and longitude of each property. This will allow assigning each property to one of the artificial neighbourhoods.

In [1]:
from urllib.request import urlretrieve
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import sys
import os

Download and load Edinburgh Postcode table that contains latitude and longitude

In [2]:
if not os.path.isfile('EdinburghPostcodes.csv'):
    # If the file is not available on disk, download it
    urlretrieve ('https://www.doogal.co.uk/AdministrativeAreasCSV.ashx?district=S12000036', 'EdinburghPostcodes.csv')
df_postcodes = pd.read_csv('EdinburghPostcodes.csv', usecols=['Postcode', 'Latitude', 'Longitude'])
df_postcodes.columns = map(str.lower, df_postcodes.columns)

Define function the extract data for properties from a BeautifulSoup of a html webpage

In [3]:
def get_property_type_from_sold_property_page(http_address):
    soup = BeautifulSoup(get(http_address).text, 'html.parser')
    return soup.find(id='propertydetails').find_all('h2')[1].text


def get_property_data_from_soup(soup):
    # Extract data from the http soup
    date = []
    address = []
    bedrooms = []
    price = []
    property_type = []
    for soup_property in soup.find_all(class_='soldDetails'):
        # Skip properties for which there is no link to post on RightMove website
        if not soup_property.find(class_='soldAddress').has_attr('href'):
            continue
        else:
            property_http_address = soup_property.find(class_='soldAddress')['href']
        # Skip properties for which there is no number of bedrooms information
        if len(soup_property.find(class_='noBed').text) == 0:
            continue
        # Collect data for the property
        date.append(soup_property.find(class_='soldDate').text)
        address.append(soup_property.find(class_='soldAddress').text)
        bedrooms.append(soup_property.find(class_='noBed').text)
        price.append(soup_property.find(class_='soldPrice').text)
        # Attempt to collect property type
        try:
            property_type.append(get_property_type_from_sold_property_page(property_http_address))        
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            property_type.append('')
            print('Error when collecting property type.')
            print(sys.exc_info()[0])
    # Format data into pandas.DataFrame
    df = pd.DataFrame({'date': date, 
                       'address': address, 
                       'bedrooms': bedrooms, 
                       'property_type': property_type, 
                       'price': price}, 
                      columns=['date', 'address', 'bedrooms', 'property_type', 'price'])
    # Sort the DataFrame by date as well as address
    df.sort_values(['date', 'address'], ascending=[False, True], inplace=True)
    
    return df

Create a class to manage web scraping rate

In [4]:
from time import time, sleep

class RateManager(object):
    
    def __init__(self, min_interval, max_interval):
        """
        min_interval - float - minimum delay between calls (in seconds)
        max_interval - float - maximum delay between calls before notification (in seconds)
        """
        self.min_interval = min_interval
        self.max_interval = max_interval
        self.checkpoint = None
        
    def continue_when_ready(self, sleep_interval=0.1, print_interval=False):
        # This is in case of first call to continue_when_ready
        if self.checkpoint is None:
            self.checkpoint = time()
            return None
        # Check if max_interval has been surpassed
        if time() - self.checkpoint > self.max_interval:
            if print_interval:
                print('Interval duration: {}'.format(time() - self.checkpoint))
            self.checkpoint = time()
            return 'timeout'
        # If not over max_interval, wait until min_interval is reached
        if print_interval:
            print('Interval duration: {}'.format(time() - self.checkpoint))
        while time() - self.checkpoint < self.min_interval:
            sleep(sleep_interval)
        self.checkpoint = time()
        return 'intime'


Acquire residential property sales prices from RightMove.

There are likely duplicates in the resulting DataFrame. These will be dealt with later.

In [5]:
if os.path.isfile('EdinburghPropertiesRaw.p'):
    # If the script has already been run, load the result from disk
    df_property = pd.read_pickle('EdinburghPropertiesRaw.p')
else:
    # List the http addresses for different areas of interest in Edinburgh
    http_addresses = {
        'Stockbridge': 'https://www.rightmove.co.uk/house-prices/detail.html?country=scotland&locationIdentifier=REGION%5E66977&searchLocation=Stockbridge&propertyType=3&year=2&referrer=listChangeCriteria', 
        'NewTown': 'https://www.rightmove.co.uk/house-prices/detail.html?country=scotland&locationIdentifier=REGION%5E79909&searchLocation=New+Town&propertyType=3&year=2&referrer=listChangeCriteria', 
        'Morningside': 'https://www.rightmove.co.uk/house-prices/detail.html?country=scotland&locationIdentifier=REGION%5E86881&searchLocation=Morningside&propertyType=3&year=2&referrer=listChangeCriteria', 
        'EdinburghNorth': 'https://www.rightmove.co.uk/house-prices/detail.html?country=scotland&locationIdentifier=REGION%5E93604&searchLocation=Edinburgh+North&propertyType=3&year=2&referrer=listChangeCriteria', 
        'EdinburghEast': 'https://www.rightmove.co.uk/house-prices/detail.html?country=scotland&locationIdentifier=REGION%5E93601&searchLocation=Edinburgh+East&propertyType=3&year=2&referrer=listChangeCriteria', 
        'EdinburghWest': 'https://www.rightmove.co.uk/house-prices/detail.html?country=scotland&locationIdentifier=REGION%5E93613&searchLocation=Edinburgh+West&propertyType=3&year=2&referrer=listChangeCriteria', 
        'EdinburghSouth': 'https://www.rightmove.co.uk/house-prices/detail.html?country=scotland&locationIdentifier=REGION%5E93610&searchLocation=Edinburgh+South&propertyType=3&year=2&referrer=listChangeCriteria', 
        'Edinburgh': 'https://www.rightmove.co.uk/house-prices/detail.html?country=scotland&locationIdentifier=REGION%5E475&searchLocation=Edinburgh&propertyType=3&year=2&referrer=listChangeCriteria'
        }

    # Specify in-line function for suffix that specifies property list page index
    http_index_suffix = lambda index: '&index={}'.format(index)

    # Specify indices to work through
    indices = list(range(0, 1025, 25))

    # Create empty pandas.DataFrame to append new data to
    df_property = pd.DataFrame({'date': [], 
                                'address': [], 
                                'bedrooms': [], 
                                'property_type': [], 
                                'price': []}, 
                               columns=['date', 'address', 'bedrooms', 'property_type', 'price'])

    # Use RateManager to avoid overwhelming the website
    rate_manager = RateManager(min_interval=5, max_interval=20)
    max_timeouts = 10
    timeout_count = 0

    # Loop through all http addresses of different areas and all possible page indices
    df_prev_property_list = pd.DataFrame({})
    for http_address in [http_addresses[x] for x in http_addresses]:
        for index in indices:
            full_http_address = http_address + http_index_suffix(index)
            print('Visiting webpage:\n' + full_http_address)
            # Make sure webpage is not visited too often and that it is not blocking
            if rate_manager.continue_when_ready(print_interval=True) == 'timeout':
                timeout_count += 1
                if timeout_count > max_timeouts:
                    raise RuntimeError('Too many timeouts.')
            # Get website html as BeautifulSoup
            soup = BeautifulSoup(get(full_http_address).text, 'html.parser')
            # Check if there is a property price data list on this page
            if len(soup.find_all(class_='soldDetails')) == 0:
                print('No properties listed on this page. Stopping index iteration.')
                break
            df_next_property_list = get_property_data_from_soup(soup)
            # If the new DataFrame is equal to the previous one, stop checking further indices
            if df_prev_property_list.equals(df_next_property_list):
                print('Property list repeated. Stopping index iteration.')
                break
            else:
                # Append the new property list to main property list and store to check against next one
                print('Got {} properties.'.format(df_next_property_list.shape[0]))
                df_property = df_property.append(df_next_property_list)
                df_prev_property_list = df_next_property_list
    # Save collected property data to disk
    df_property.to_pickle('EdinburghPropertiesRaw.p')
# Report number of properties in raw web scarping result
print('Collected total of {} properties from RightMove.'.format(df_property.shape[0]))

Collected total of 2952 properties from RightMove.


Format `df_property` DataFrame and keep only the essential information

In [6]:
# Define function for extracting number of bedrooms and
# general property type from raw property type data
def get_bedrooms_and_type(raw_property_type):
    pos = raw_property_type.find(' bedroom ')
    raw_property_type = raw_property_type.replace(' bedroom ', ' ')
    bedrooms = raw_property_type[:pos].strip()
    property_type = raw_property_type[pos:].strip()
    
    return bedrooms, property_type

# Reindex DataFrame
df_property.reset_index(drop=True, inplace=True)

# Remove duplicates
df_property.drop_duplicates(keep='first', inplace=True)
# Reindex DataFrame
df_property.reset_index(drop=True, inplace=True)

# Find property_type 'Studio flat' and set it to 1 bedroom flat
idx_studio_flat = df_property['property_type'] == 'Studio flat'
df_property['property_type'].loc[idx_studio_flat] = '1 bedroom flat'

# Remove properties for which property_type is not in correct format
indices = [i for i, x in enumerate(df_property['property_type']) if not (' bedroom ' in x)]
df_property.drop(indices, axis=0, inplace=True)
# Reindex DataFrame
df_property.reset_index(drop=True, inplace=True)

# Extract number of bedrooms and general property type from property_type values
bedrooms, property_type = zip(*[get_bedrooms_and_type(x) for x in df_property['property_type']])
df_property['bedrooms'] = list(map(int, bedrooms))
df_property['property_type'] = property_type

# Rename all flat-like property_types to flats
func = lambda x: 'flat' if 'flat' in x or 'apartment' in x or 'penthouse' in x else x
df_property['property_type'] = df_property['property_type'].apply(func)
# Rename all house-like property_types to house
func = lambda x: 'house' if 'house' in x or 'villa' in x or 'duplex' in x or 'bungalow' in x or 'cottage' in x else x
df_property['property_type'] = df_property['property_type'].apply(func)

# Remove all other property_types than flat or house
idx = (df_property['property_type'] != 'flat') & (df_property['property_type'] != 'house')
df_property.drop(df_property[idx].index, axis=0, inplace=True)
# Reindex DataFrame
df_property.reset_index(drop=True, inplace=True)

# Only keep postcode from address
func = lambda x: ' '.join(x.split()[-2:])
df_property['address'] = df_property['address'].apply(func)
# Rename address column to postcode
df_property.rename(columns={'address': 'postcode'}, inplace=True)

# Print remaining number of properties
print('Property sale price samples remaining after filtering the data: {}'.format(df_property.shape[0]))

Property sale price samples remaining after filtering the data: 1914


Add longitude and latitude data into `df_property` based on postcode

In [7]:
# Merge on latitude and longitude values
df_property = df_property.merge(df_postcodes, how='left', on='postcode')
# Drop rows where latitude and longitude were not available for postcode
df_property.dropna(inplace=True)
# Reindex DataFrame
df_property.reset_index(drop=True, inplace=True)
# # Drop postcode column
# df_property.drop('postcode', axis='columns', inplace=True)
print('Property sale price samples remaining that have \n' + 
      'latitude and logitude values: {}'.format(df_property.shape[0]))
df_property.head()

Property sale price samples remaining that have 
latitude and logitude values: 1906


Unnamed: 0,date,postcode,bedrooms,property_type,price,latitude,longitude
0,31 Dec 2018,EH9 1PN,2,flat,"£220,000",55.935498,-3.179898
1,28 Dec 2018,EH8 7HW,2,flat,"£251,500",55.95252,-3.146224
2,28 Dec 2018,EH30 9LE,5,house,"£512,760",55.988611,-3.38542
3,28 Dec 2018,EH8 7EB,3,house,"£268,379",55.954279,-3.154108
4,27 Dec 2018,EH5 1ER,2,flat,"£255,000",55.980068,-3.215696
