# MEI Data Science Taught Course

# Lesson 1: House price data

The data for this lesson is from: the HM Land Registry https://www.gov.uk/government/organisations/land-registry/about. It shows the records of all house sales in the UK between 1995 and 2017 (over 22 million transactions).

This notebook is an example of working with a very big data set.

The first block of code imports some libraries and gets the path to the data.

In [None]:
# import pandas
import pandas as pd

# import matplotlib
import seaborn as sns

# display the data files linked to this notebook - input data files are available in the read-only "/kaggle/input/" directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

The next block of code imports the data.

*Note that this will take longer than usual because of the size of the data set.*

In [None]:
# import the data
house_data = pd.read_csv('/kaggle/input/uk-housing-prices-paid/price_paid_records.csv')

# display the data set to check it has imported correctly
house_data

In [None]:
# display the info for the data set
house_data.info()

## Adding a derived field
The `Date of Transfer` field gives the date of sale in the yyyy-mmm-dd hh:mm format. The first 4 characters of this are the year of sale and this can be extracted and written to a new field: `Year`.

In [None]:
# create a new field Year from the first 4 characters of the Date of Transfer field
house_data['Year']=house_data['Date of Transfer'].str[:4]

# display the data to check
house_data

# Exploratory Data Analysis

In this section you will explore how the average house price for a specific City (Leeds) has changed over the time period of this data set by find the mean price for each year and plotting a time series for different property types.

## Extracting rows
A new data set with just the sales from Leeds can be constructed.

In [None]:
# create a new data set called leeds_data where the Town/City value is 'LEEDS'
# adding .copy() to the command informs pandas to create a copy first - this is not essential in this example but is good practice in general when "slicing" a data set. 
leeds_data = house_data[house_data['Town/City'] == 'LEEDS'].copy()

# display the data set to show that it has been created correctly
leeds_data

## Exploring the mean and standard deviation for the price grouped by year

In [None]:
leeds_data.groupby('Year').agg({'Price': ['count','mean', 'std']})

This appears to be consistent with https://en.wikipedia.org/wiki/Affordability_of_housing_in_the_United_Kingdom which suggests that London median prices went from £80,000 to £300,000 in a similar time frame.

### Exploring the mean price for different property types

Generate a new table with the mean for each property type for each year.

In [None]:
# create a new data set with the mean price for each year for each property type
leeds_year_type_data=leeds_data.groupby(['Property Type','Year'])['Price'].mean().reset_index()

# display the data set
leeds_year_type_data

In [None]:
sns.relplot(kind='line', data=leeds_year_type_data, x='Year', y='Price', hue='Property Type', aspect=2);

The data for *Other* sales appears very different from the data for detached, semi-detached, terraced and flats. You can explore whether it is appropriate to disregard these records. 

In [None]:
# display the data for 'Other' sales
leeds_data[leeds_data['Property Type']=='O']

There are 1015 records (out of 252,680  sales). These all appear to have the `PPD Category Type` set as `B`. A simple query will check whether there are any of type `A`.

In [None]:
# check if there are any sales of 'Other' properties of 'PPD type' A
leeds_data[(leeds_data['Property Type']=='O') & (leeds_data['PPDCategory Type']=='A')]

All the data for 'Others' is PPD Category Type B: Additional Price Paid entry including transfers under a power of sale/repossessions, buy-to-lets (where they can be identified by a Mortgage) and transfers to non-private individuals. In the context of house sales it is appropriate to disregard these data as they are not representative of conventional transactions.

Make a new data set of just the property sales of type A: Standard Price Paid entry, includes single residential property sold for full market value.

In [None]:
# drop the values that have type 'B'
leeds_data2=leeds_data.drop(leeds_data[leeds_data['PPDCategory Type']=='B'].index)

# display the data to check they have been dropped
leeds_data2

Make a new dataset of the means.

In [None]:
# recreate the data set of means grouped by year
leeds_year_type_data2=leeds_data2.groupby(['Property Type','Year'])['Price'].mean().reset_index()

# diplay the data set
leeds_year_type_data2

In [None]:
sns.relplot(kind='line', data=leeds_year_type_data2, x='Year', y='Price', hue='Property Type', aspect=2);

This shows that between 2000 and 2005 the mean cost of flats was higher than semi-detached houses but that since 2008 they have been sold for similar prices to terraced houses.

## Exploring the sales for a different town or city

In [None]:
# display the unique values for the Town/City field
# house_data['Town/City'].unique()