# Optional: House Prices Data Analysis

![teaser](images/teaser.jpg)

When encountering real life data, there are various augmentations you can apply to improve the readability of said data for neural networks. Additionally, you can do a pre-selection and transformation on the raw features which can also boost your performance.

This notebook can be seen as a starting point to analyize the house prices data introduced in __4_HousePrices-Classification__. We will load the data and show of some useful pandas functions that can help you to select useful transformation choices to improve your regression performance for said notebook. Let's go!

In [None]:
# Installation of seaborn which is used here for visualisation
!pip install seaborn

In [None]:
# As usual, a bit of setup

import time
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
%load_ext autoreload
%autoreload 2

# House Price Data
### Exploration

Make sure to run the *download_datasets.sh* script first before running the upcoming cell. Previously, we provided you with a data loading wrapper function to access the CIFAR10 data. This time around, our input is a csv file which we will load ourselves using [pandas](https://pandas.pydata.org) where we can easily access and alter entries in our data matrix. Let's have a small glimpse how the data looks like!

In [None]:
# Load the data
data = pd.read_csv("datasets/house_prices.csv", index_col=False)

In [None]:
#You can easily get an overview of our features using .info(). Note that not all features are actually numbers!
data.info()

In [None]:
# Using the describe function we can get an overview about numerical ranges
data.describe()

In [None]:
# Our target variable is the SalesPrice which we explore in detail here
data['SalePrice'].describe()

In [None]:
# Relationship with the numerical features. We explore only two here as a sample
# GrLivArea
# TotalBsmtSF
var = 'GrLivArea'
relationship_in_df = pd.concat([data['SalePrice'], data[var]], axis=1)
relationship_in_df.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

In [None]:
# Scatter plot of totalbsmtsf/saleprice
var = 'TotalBsmtSF'
relationship_in_df = pd.concat([data['SalePrice'], data[var]], axis=1)
relationship_in_df.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

In [None]:
# General correlation matrix
corrmat = data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

In [None]:
# Saleprice correlation matrix:
# We look at the 10 most correlated variables for our target "Sale Price"
k = 10 
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(data[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

This shows that there are certain attributes much more correlated than the others for SalePrice. Next, we explore the scatter plot of selected attributes

In [None]:
# Scatterplots
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(data[cols], size = 2.5)
plt.show();

## Follow up steps

You can use the rest of this notebook for your own data exploration and the corresponding selection of useful features or ideas for data transformation which you can then apply in the main house prices notebook to improve your network performance!