# MEI Data Science Taught Course

# Lesson 1: UK house prices

The data for this lesson is from: the HM Land Registry https://www.gov.uk/government/organisations/land-registry/about. It shows the records of all house sales in the UK between 1995 and 2017 (over 22 million transactions).

This notebook is an example of working with a very big data set.

The first block of code imports some libraries.

In [None]:
# import pandas
import pandas as pd

# import seaborn
import seaborn as sns

The next block of code imports the data.

*Note that this will take longer than usual because of the size of the data set.*

In [None]:
# import the data
house_data = pd.read_csv('/kaggle/input/uk-housing-prices-paid/price_paid_records.csv')

# display the data set to check it has imported correctly
house_data

In [None]:
# display the info for the data set
house_data.info()

## Pre-processing and preparation of data
### Adding a derived feature
The `Date of Transfer` feature gives the date of sale in the yyyy-mmm-dd hh:mm format. The first 4 characters of this are the year of sale and this can be extracted and written to a new column: `Year`.

In [None]:
# create a new column Year from the first 4 characters of the Date of Transfer feature
house_data['Year']=house_data['Date of Transfer'].str[:4]

# display the data to check
house_data

### Extracting a subset of the data set to explore
You can explore the data for a single town or city. In this example a new data set with just the sales from Leeds is constructed.

Selecting a subset of rows (or columns) is known as *slicing* a dataframe.

In [None]:
# create a new data set called leeds_data where the Town/City value is 'LEEDS'
# adding .copy() to the command informs pandas to create a copy first - this is not essential in this example but is good practice in general when "slicing" a data set. 
leeds_data = house_data[house_data['Town/City'] == 'LEEDS'].copy()

# display the data set to show that it has been created correctly
leeds_data

## Exploratory Data Analysis

In this section you will explore how the average house price for a specific city (Leeds) has changed over the time period of this data set by finding the median price for each year and plotting a time series for different property types.

### Exploring the average price grouped by year

In [None]:
leeds_data.groupby('Year')['Price'].median()

This appears to be consistent with https://en.wikipedia.org/wiki/Affordability_of_housing_in_the_United_Kingdom which suggests that London median prices went from £80,000 to £300,000 in a similar time frame.

### Exploring the median price per year for different property types

Generate a new table with the median for each property type for each year.

In [None]:
# create a new data set with the median price for each year for each property type
# note that as_index=false is needed when generating a table from a groupby command
leeds_year_type_data = leeds_data.groupby(['Property Type','Year'], as_index=False)['Price'].median()

# display the data set
leeds_year_type_data

In [None]:
# plot lines showing the change in years over different property types
sns.relplot(kind='line', data=leeds_year_type_data, x='Year', y='Price', hue='Property Type', aspect=2);

The data for *Other* sales appears very different from the data for detached, semi-detached, terraced and flats. You can explore whether it is appropriate to disregard these records. 

In [None]:
# display the data for 'Other' sales
leeds_data[leeds_data['Property Type']=='O']

There are 1015 records with `Property Type` set as `O`, (out of 252,680  sales). This represents fewer than 1% of the sales. These *Other* property types can be disregarded for our visualisation.

In [None]:
# create a copy of leeds_year_type_data where the Property type is not equal to O
leeds_year_type_data2 = leeds_year_type_data[leeds_year_type_data['Property Type'] != 'O'].copy()

# display the new data to check it has been created correctly
leeds_year_type_data2

In [None]:
# plot a lines showing the change in years over different property types
sns.relplot(kind='line', data=leeds_year_type_data2, x='Year', y='Price', hue='Property Type', aspect=2);

This shows that between 2000 and 2005 the mean cost of flats was higher than semi-detached houses but that since 2008 they have been sold for similar prices to terraced houses.

## Exploring the sales for a different town or city

In [None]:
# display the unique values for the Town/City feature
# house_data['Town/City'].unique()