# Machine Learning Analysis - London Housing Data

This notebook will apply supervised ML models from scikit-learn to analyse the real estate data in ```10m_london_houses_.csv```.

## Plan
Split into 3 sections by the data used: SQM-only, all data and top features
- SQM-only: get data, run simple models, plot predictions and show R2 for some cross-validation folds (no HP tuning)
- All data: go into more detail and do HP tuning
    - Prepare data: convert categoric to dummy variables, scale data
    - Create simple models and plot results
    - HP tune the models via RandomizedSearchCV
    - Evaluate the models using R2, RMSE (NMSE?), MAE and MAPE
- Top features: compare the performance of models using just the top features with the last models
    - Run PCA - determine the top third of features
    - Train models on just these features
    - HP tune these models via RandomizedSearchCV
    - Evaluate the models using R2, RMSE (NMSE?), MAE and MAPE
- Conlusion
    - How good are the models?
    - What have we learned?
    - Any key takeaways

Repository link: [Github](https://github.com/rsamconn/London-housing)

In [4]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge

print(np.NaN)

nan


In [3]:
# Load the data
housing_df = pd.read_csv('data/10m_london_houses_.csv')

# Reminder of how the data looks
print(f"Dataset shape: {housing_df.shape}")
display(housing_df.head())

Dataset shape: (1000, 17)


Unnamed: 0,Address,Neighborhood,Bedrooms,Bathrooms,Square Meters,Building Age,Garden,Garage,Floors,Property Type,Heating Type,Balcony,Interior Style,View,Materials,Building Status,Price (£)
0,78 Regent Street,Notting Hill,2,3,179,72,No,No,3,Semi-Detached,Electric Heating,High-level Balcony,Industrial,Garden,Marble,Renovated,2291200
1,198 Oxford Street,Westminster,2,1,123,34,Yes,No,1,Apartment,Central Heating,High-level Balcony,Industrial,City,Laminate Flooring,Old,1476000
2,18 Regent Street,Soho,5,3,168,38,No,Yes,3,Semi-Detached,Central Heating,No Balcony,Industrial,Street,Wood,Renovated,1881600
3,39 Piccadilly Circus,Islington,5,1,237,53,Yes,Yes,1,Apartment,Underfloor Heating,No Balcony,Classic,Park,Granite,Renovated,1896000
4,116 Fleet Street,Marylebone,4,1,127,23,No,Yes,2,Semi-Detached,Central Heating,No Balcony,Modern,Park,Wood,Old,1524000


Our data has 1000 rows and 17 columns.

6 of these contain numeric data and 11 categoric data.

The EDA notebook contains a more thorough analysis the data.

First, let's set aside 100 rows to be used for validating the model outputs later.

In [None]:
validation_set = housing_df

## First attempt - only look at house area
It's often mentioned that the most important factor in determining house prices is the size.

Let's take a look at that idea by building a model that only uses the Square Meters data as a feature.

In [5]:
# Create sqm-only dataset
X_sqm = housing_df['Square Meters'].values.reshape(-1, 1)
y = housing_df['Price (£)'].values
X_train_sqm, X_test_sqm, y_train_sqm, y_test_sqm = train_test_split(X_sqm, y, test_size=0.3, random_state=24)

print(X_train_sqm.shape, X_test_sqm.shape, y_train_sqm.shape, y_test_sqm.shape)

(700, 1) (300, 1) (700,) (300,)
