This notebook is an example of working with a very big data set.

The first block of code imports some libraries and gets the path to the data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed

# import pandas
import pandas as pd

# import matplotlib
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Import the data and check it.

*Note that this will take longer than usual because of the size of the data set.*

In [None]:
#import the data and check by view the top rows
house_data = pd.read_csv('/kaggle/input/uk-housing-prices-paid/price_paid_records.csv')
house_data.head()

The shape command gives a sense of how big this dataset is.

In [None]:
house_data.shape

# Exploratory Data Analysis

In this section you will explore how the average house price for a specific City (Leeds) has changed over the time period of this data set by find the mean price for each year and plotting a time series for different property types.

## Adding a derived field
The `Date of Transfer` field gives the date of sale in the yyyy-mmm-dd hh:mm format. The first 4 characters of this are the year of sale and this can be extracted and written to a new field: `Year`.

In [None]:
house_data['Year']=house_data['Date of Transfer'].str[:4]
house_data.head()

## Extracting rows
A new data set with just the sales from Leeds can be constructed.

In [None]:
leeds_data = house_data[house_data['Town/City'] == 'LEEDS']
leeds_data.shape

Exploring the mean price by year.

In [None]:
print(leeds_data.groupby(['Year'])['Price'].mean())

This appears to be consistent with https://en.wikipedia.org/wiki/Affordability_of_housing_in_the_United_Kingdom which suggests that London median prices went from £80,000 to £300,000 in a similar time frame.

Generate a new table with the mean for each property type for each year.

In [None]:
leeds_year_type_data=leeds_data.groupby(['Property Type','Year'])['Price'].mean().reset_index()
leeds_year_type_data.head()

In [None]:
# create a plot with Years on the x axis and the regions on the y axis 
fig, ax = plt.subplots(figsize=(20,10))
leeds_year_type_data.groupby('Property Type').plot(x='Year',y=['Price' ],ax=ax)
ax.legend(labels=leeds_year_type_data.groupby('Property Type').groups.keys())
plt.show()

Explore the others

In [None]:
leeds_data[leeds_data['Property Type']=='O']

In [None]:
leeds_data[(leeds_data['Property Type']=='O') & (leeds_data['PPDCategory Type']=='A')]

All the data for 'Others' is PPD Category Type B: Additional Price Paid entry including transfers under a power of sale/repossessions, buy-to-lets (where they can be identified by a Mortgage) and transfers to non-private individuals. 

Make a new data set of just the property sales of type A: Standard Price Paid entry, includes single residential property sold for full market value.

In [None]:
leeds_data2 = leeds_data[leeds_data['PPDCategory Type']=='A']
leeds_data2.shape

Making a dataset of the means

In [None]:
leeds_year_type_data2=leeds_data2.groupby(['Property Type','Year'])['Price'].mean().reset_index()
leeds_year_type_data2.head()

In [None]:
#create a single set of axes to plot the series on 
fig, ax = plt.subplots(figsize=(20,10))

# plot the grouped data on the axes
leeds_year_type_data2.groupby('Property Type').plot(x='Year',y=['Price' ],ax=ax)

# set the legend to the keys
ax.legend(labels=leeds_year_type_data2.groupby('Property Type').groups.keys())
plt.show()