# Introduction

### Analysis for UK Houses Price Paid data set:
https://www.kaggle.com/hm-land-registry/uk-housing-prices-paid
https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

I have used Python, R, GGPlot and Plotl.ly to analyse over 20 Million houses to compare the number of houses sold in each year and the price of houses. Instead of comparing averagge houses by location or any other aggregated analysis, we will directly compare the price of the same house being sold across years.

Because there are so many rows, to run this notebook at least 16GB of RAM is required and we have taken visual steps to reduce the computation and visual strain of so much data.

This is the first part of my analysis, because I have used the raw data from the GOV website, I have linked the seond and third parts that allows me to show use this data. In summary there are three sections:

Part 1.1: Data, Exploratory Analysis and Timeline of Monthly House Sales Analysis

Part1.2: Oganising Data for New Features 

Part 2: Connecting House Sales Between Years
https://www.philiposbornedata.com/2018/03/07/uk-house-price-analysis-part-2/

![Correlation of the Same House Being Sold between Years 1995 to 1997](https://i.imgur.com/PjkvrnX.gifv)
[Correlation of the Same House Being Sold between Years 1995 to 1997](https://i.imgur.com/PjkvrnX.gifv)

Part 3: Connecting House Sales In the Same Year
https://www.philiposbornedata.com/2018/03/07/uk-house-price-analysis-part-3/

![Heat Map for number of House Sales between Years](https://i.imgur.com/VrmHpEs.png)
[Heat Map for number of House Sales between Years](https://i.imgur.com/VrmHpEs.png)



In [1]:
import pandas as pd
import numpy as np

import plotly.plotly as py
import plotly.graph_objs as go


from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

from numpy import arange,array,ones
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

from datetime import datetime

import math

## Part 1: Data and Exploratory Analysis

Import Data


Unfortunately, the data on kaggle doesn’t contain the address location that we use later for comparing the houses but my intro analysis is still doable with this data.

In [2]:
dataimport = pd.read_csv('../input/price_paid_records.csv', dtype=object)

(dataimport).head()

In [3]:
len(dataimport)

### Duplicates

First we check to see if there are any duplicate rows in the data, in this data we indentify none. 

However, using the full price paid data from the GOV website searching for duplicates on all columns except ID shows that approximately 0.05% are in fact duplicates but we can ignore this as there is no way to indentify these with the data provided.

In [4]:
len(dataimport.drop_duplicates())

## House Price Distribution

Let us take a quick exploratory look at the distribution of house prices. We see that the majority of house prices across all years is less than £500,000.

I first created a histogram using plot.ly but due to the sheer size this caused performance issues so instead resorted to using a simple matplolib plot. 

Unfortunately neither seems to work in kaggle properly so have run it locall to get the information required for our next steps. 

In [5]:
# Do not run this in Kaggle, not enough resources.
#hist_trace = go.Histogram( x = dataimport['Price'],
#                          name = 'All House Prices',
#                          xbins = dict(
#                          start = 0,
#                          end = 1000000,
#                          size = 100000
#                          ),
#                          marker = dict(
#                          color = '#EB89B5')
#                         )
#histlayout = go.Layout(
#    title='Distribution of All House Prices',
#    xaxis=dict(
#        title='House Price (£)'
#    ),
#    yaxis=dict(
#        title='Count'
#    ),
#    bargap=0.05,
#    bargroupgap=0.1
#)
#histdata = [hist_trace]
#histfig = go.Figure(data=histdata,layout=histlayout)

# In line plot
#iplot(histfig)

# Shows as new file instead
#plot(histfig)

In [6]:
#Alternative Matplotlib histogram that can't be run in Kaggle due to size of data

#plt.hist(dataimport['Price'])
#plt.title('All House Prices Distribution')
#plt.xlabel('House Price (£)')
#plt.ylabel('Count')

#plt.show()

## Remove outliers

I then reduced the data by removing extreme house prices. Those are the ones that are:

- less than £10 Million and,
- greater than £10,000

In [7]:
dataimport['Price'] = dataimport['Price'].astype(str).astype(int)

In [21]:
dataimport3 = dataimport.loc[(dataimport['Price'] < (10000000)) & (dataimport['Price'] > (10000)),]
len(dataimport3)

## Convert ‘Date of Transfer’ to a datetime and then create ‘Year’ column


In [22]:
dataimport3['Date of Transfer'] = pd.to_datetime(dataimport3['Date of Transfer'])

dataimport3['Year'] = dataimport3['Date of Transfer'].dt.year

# Final Analysis of Part 1

## Lastly for this part, I create a timeline plot of the number of house sales in each month between 1995 and 2017 in Plot.ly

You can interactive with this using the slider to select specific time frames and have also added some simple annotations to highlight two key events: 1) the 2008 Financial Crash and 2) the 2016 UK EU Referendum.


In [37]:
# Create a list of a day from each month from January 1995 to December 2017
daterange = pd.date_range('1995-01-01','2017-06-30' , freq='1M')
daterange = daterange.union([daterange[-1] ])
daterange = [d.strftime('%d-%m-%Y') for d in daterange]

# Use group by to calculate the number of sales of each month
fulldatafortimeplot = dataimport3.groupby([(dataimport3["Date of Transfer"].dt.year),(dataimport3["Date of Transfer"].dt.month)]).count()
fulldatafortimeplot2 = pd.DataFrame(fulldatafortimeplot['Transaction unique identifier'])
fulldatafortimeplot2['Dates'] = daterange
fulldatafortimeplot2['Dates'] = pd.to_datetime(fulldatafortimeplot2['Dates'], format = '%d-%m-%Y')
fulldatafortimeplot2.columns = ['Count', 'Dates']
fulldatafortimeplot2 = fulldatafortimeplot2

In [38]:
# Plot.ly Timeline plot with range slider
trace_time = go.Scatter(
    x=fulldatafortimeplot2['Dates'],
    y=fulldatafortimeplot2['Count'],
    name = "Number of House Sales",
    line = dict(color = '#7F7F7F'),
    opacity = 0.8)

data_timline = [trace_time]

layout_timeline = go.Layout(
    dict(
    title='Timeline of the Number of House Sales in the UK between 1995 and 2017',
    xaxis=dict(

        rangeslider=dict(),
        type='date'
    ),
    annotations = [
        dict(
        x = datetime.strptime('23-06-2016', '%d-%m-%Y'),
        y = 84927,
        xref = 'x',
        yref = 'y',
        text = 'UK Referendum',
        showarrow = True,
        arrowhead = 7,
        ax = 0,
        ay = -40
        ),
        dict(
        x = datetime.strptime('01-12-2007', '%d-%m-%Y'),
        y = 104283,
        xref = 'x',
        yref = 'y',
        text = 'Financial Crash',
        showarrow = True,
        arrowhead = 7,
        ax = 0,
        ay = -40
        )
    ]

)
)

fig = dict(data=data_timline, layout=layout_timeline)
iplot(fig)

Lastly, we export our data to a new csv so that we can pick this up in the section kernal for this part. We were unable to run the whole process in one kernel due to run time issues.