## House Prices Dataset Analysis

### Table of Contents
<ol>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ol>

<a id='intro'></a>
## 1. Introduction

[The Ames Housing dataset](http://jse.amstat.org/v19n3/decock.pdf) was compiled by Dean De Cock for use in data science education and it's a great a;ternative to the Boston Housing dataset. It describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. This dataset contains only residential sales within the date set, only the most recent sales data on any property. 

The dataset contains 2919 observations separated in:
* training set, with 1460 observations
* testing set, with 1459 observations.

There are 80 features involved in assessing home values, with the target variable included. They focus on the quality and quantity of many physical attributes of the property. 

The features have the following structure:

#### #1 Categorical Variables:

They range from 2 to 28 classes. We should use label encoding for these categorical variables.

* 23 **nominal**: typically identify various types of dwellings, garages, materials, and environmental conditions 
* 23 **ordinal**: ordinal variables typically rate various items within the property. 

**PID**  and **NEIGHBORHOOD** are two features of special interest.

PID (Parcel Identification Number assigned to each property within the Ames Assessor’s system) 
* This number can be used in conjunction with the [Assessor’s Office](http://www.cityofames.org/assessor/) or [Beacon](http://beacon.schneidercorp.com/) websites to directly view the records of a particular observation.
* The typical record will indicate the values for characteristics commonly quoted on most home flyers and will include a picture of the property.
* I must say that PID number was especially useful when trying to fill the missing values.

#### #2 Numeric variables:
* 14 discrete: quantify the number of items occurring within the house:
    * the number of kitchens, bedrooms, and bathrooms (full and half) located in the basement and above grade (ground) living areas of the home, the garage capacity, construction/remodeling dates.
* 19 continuous (area dimensions)
    * typical lot size and total dwelling square footage and other more specific variables are quantified in the data set like: area measurements on the basement, main living area, porches are broken down into individual categories based on quality and type.
    
I have compiled a spreadsheet (features.ods) with all the features with the description for each type of variable: 

* numeric (continuous, discrete) 
* categorical (nominal, ordinal)

### Goal

The goal of this notebook is to understand the Ames Dataset in order to uncover meaningful patterns and insights and model the data to make accurate sale price predictions.
1. First, I will assess and clean the data by: 
    * Categorize features 
    * Fill in missing values
    * Remove outliers 
2. Perform **Exploratory Data Analysis** to visualise how our variables are distributed and how they correlate to each other.
3. Fit the clean data to a simple **Linear Regression Model** in order to make a baseline model for further improvements. Using only two variables I was able to make a simple model with a **Coefficient of Determination  (R Squared)** of about 0.80. I first applied a log transformation on our target variable to make it normally distributed and then I fitted my input variables to the linear model. The two variables used in the regression are the Total Square Footage (`TotalBsmtS`F + `GrLivArea`) and the `Neighborhood`. On the second variable I used one-hot-encoding. The model was evaluated with `Root Mean Squared Error (RMSE)` with a value of about 0.17444 on the training set and 0.19363 on the testing set on the Kaggle House Prices Competition. 
This is just a baseline model which has great room for improvement and creativity on feature engineering. This model used only two features and in the dataset there are 79. Also, there are other models that should be used like XGBoost, CatBoost, LightGBM, ElasticNet and others. Stacking the results of these models and hyperparameter tuning are the next steps for a second more complex model with better predictions.

<a id='wrangling'></a>
## 2 Data Wrangling

Getting the data I need in three steps:
1. Gather 
2. Assess
3. Cleaning

### #2.1. Gather the Data

The data set can be found on Kaggle, the classic ["House Prices: Advanced Regression Techniques"](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview) competition.

In [None]:
import numpy as np # linear algebra
import pandas as pd 
import os
print(os.listdir("../input"))
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'svg' 

# Increase the default plot size and set the color scheme
plt.rcParams['figure.figsize'] = 8, 5
plt.rcParams['image.cmap'] = 'viridis'

%matplotlib inline

In [None]:
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)

In [None]:
# load the dataset
PATH_TO_DATA = '../input'

df_train = pd.read_csv(os.path.join(PATH_TO_DATA, 
                                             'train.csv'), index_col='Id')
df_test = pd.read_csv(os.path.join(PATH_TO_DATA, 
                                             'test.csv'), index_col='Id')

### #2 Assesing Data

General properties of the training and testing sets:
* Number of samples in train/test dataset: 1460, 1459
* Number of columns in train/test dataset: 80, 79
* Duplicate rows in each dataset: 0
* Datatypes: float64(3), int64(35), object(43). So, at first look we have 37 numeric variables and 43 categorical variables. Actually, there are 19 continuous features (without the target variable), 14 discrete features, 23 nominal features and 23 ordinal features
* Features with missing values: there are 19 columns in the training set with missing values and 33 in the test set
* Number of non-null unique values for features in training set
* Use the `describe` function for the statistics of the dataset:
    * the count, mean, standard deviation and the 5 number summary for each variable.

In [None]:
# take a look at the first 5 rows of the dataset
df_train.head()

In [None]:
# take a look at the first 5 rows of the dataset
df_test.head()

In [None]:
# number of samples and columns
df_train.shape, df_test.shape

In [None]:
# check for duplicates
sum(df_train.duplicated()), sum(df_test.duplicated())

In [None]:
# check the datatypes
df_train.info()

So, at first look we have 37 numeric variables and 43 categorical variables. I'll pay some attention to this, as the variable types are important when modelling our data and making predictions.

#### Features

Let's categorize our features to identify them easier for each of the variable type (numeric and categorical): continuous, discrete, nominal and ordinal.

In [None]:
# column names
df_train.columns

#### #1.a Continuous Features

In [None]:
# continuous variables
continuous_features = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 
                       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
                       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']

df_train[continuous_features].head()

In [None]:
# the number of continuous features
df_test[continuous_features].shape[1]

#### #1.b Discrete Features

In [None]:
# discrete variables
discrete_features = ['YearBuilt', 'YearRemodAdd', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 
                     'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'MoSold', 'YrSold']
# check the filter
df_train[discrete_features].head()

In [None]:
# the number of discrete features
df_test[discrete_features].shape[1]

#### #2.a Nominal Features

In [None]:
# nominal variables
nominal_features = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'Utilities', 'LotConfig', 'Neighborhood', 'Condition1', 
                    'Condition2', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
                    'Foundation', 'Heating', 'CentralAir', 'Electrical', 'GarageType', 'PavedDrive', 'MiscFeature',
                    'SaleType', 'SaleCondition']

# check the filter
df_train[nominal_features].head()

In [None]:
# the number of continuous features
df_test[nominal_features].shape[1]

#### #2.b Ordinal Features

In [None]:
# ordinal variables
ordinal_features = ['LotShape', 'LandContour', 'LandSlope', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 
                    'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
                    'HeatingQC', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageFinish', 'GarageQual', 'GarageCond', 
                    'PoolQC', 'Fence']

# check the filter
df_train[ordinal_features].head()

In [None]:
# the number of continuous features
df_test[ordinal_features].shape[1]

In [None]:
# test to check if all the columns are included
list(continuous_features + discrete_features + nominal_features + ordinal_features).sort() == list(df_test.columns).sort() 

#### Nulls in training set

In [None]:
# check for missing values
df_train.isnull().sum()[df_train.isnull().sum() > 0]

In [None]:
# the number 
nuls_columns = list(df_train.isnull().sum()[df_train.isnull().sum() > 0].index)
len(nuls_columns)

There are 19 columns with nuls in the training set. Let's identify the variable types with nuls because we have to figure out the reason why this variables are left blank in order to fill them with appropriate values.

We can see that there are some particular columns with really high number of nulls. For example, the `Alley` feature has 1369 nuls and only 91 non-nulls value. Same goes for `PoolQC`, `Fence`, `MiscFeature` and possibly `FireplaceQu`

In [None]:
# nuls in continuous features
continuous_nuls = [nul_columns for nul_columns in nuls_columns if nul_columns in continuous_features]
continuous_nuls

In [None]:
# nuls in discrete features
discrete_nuls = [nul_columns for nul_columns in nuls_columns if nul_columns in discrete_features]
discrete_nuls

In [None]:
# nuls in nominal features
nominal_nuls = [nul_columns for nul_columns in nuls_columns if nul_columns in nominal_features]
nominal_nuls

In [None]:
# nuls in ordinal features
ordinal_nuls = [nul_columns for nul_columns in nuls_columns if nul_columns in ordinal_features]
ordinal_nuls

In the training set, we have nulls in 19 columns: 2 continuous, 1 discrete, 5 nominal and 11 ordinal columns. 

#### Nulls in testing set

In [None]:
# check for missing values
df_test.isnull().sum()[df_test.isnull().sum() > 0]

Same for the testing set, we can see the features: `Alley`, `PoolQC`, `Fence`, `MiscFeature` and `FireplaceQu` have lots of nulls. I'll look at these variable to figure out the reason for this.

In [None]:
# the number of variables with nulls
nuls_columns = list(df_test.isnull().sum()[df_test.isnull().sum() > 0].index)
len(nuls_columns)

We can see that there are 14 more variables with nuls in the testing set. Let's identify them.

In [None]:
# nuls in continuous features
continuous_nuls = [nul_columns for nul_columns in nuls_columns if nul_columns in continuous_features]
continuous_nuls

In [None]:
# nuls in discrete features
discrete_nuls = [nul_columns for nul_columns in nuls_columns if nul_columns in discrete_features]
discrete_nuls

In [None]:
# nuls in nominal features
nominal_nuls = [nul_columns for nul_columns in nuls_columns if nul_columns in nominal_features]
nominal_nuls

In [None]:
# nuls in ordinal features
ordinal_nuls = [nul_columns for nul_columns in nuls_columns if nul_columns in ordinal_features]
ordinal_nuls

In the testing set, we have nuls in 33 columns: 7 continuous features, 4 discrete column, 9 nominal columns and 13 ordinal columns. 

#### Nulls in Categorical variables 

Looking at the description for each variable I found that 'NA' stands for:

* `Alley`: No alley access
* `GarageType`: No Garage
* `MiscFeature`: None
* `BsmtQual`, `BsmtCond`, `BsmtExposure`, `BsmtFinType1`, `BsmtFinType2`: No Basement
* `FireplaceQu`: No Fireplace
* `Pool quality`, `PoolQC`: No pool
* `GarageFinish`, `GarageQual`, `GarageCond`: No Garage
* `Fence`: No Fence

These should be replaced by other variables in order to account for them when encoding our features.


Some other features with Nulls that require further investigations: `MSZoning`, `Utilities`, `Exterior1st`, `Exterior2nd`, `MasVnrType`, `SaleType`, `Functional`.

#### Non-null unique values

In [None]:
# let's filter only for discrete, nominal and ordinal features
unique_filter =  discrete_features + nominal_features + ordinal_features

In [None]:
# non-null unique values for ordinal features
df_train[ordinal_features].nunique()

In [None]:
# non-null unique values for nominal features
df_train[nominal_features].nunique()

In [None]:
# non-null unique values for nominal features
df_train[discrete_features].nunique()

In [None]:
df_train[discrete_features].head()

In [None]:
# non-null unique values differences between training and testing set
df_diff_features = df_train[unique_filter].nunique() - df_test[unique_filter].nunique()
df_diff_features = df_diff_features[df_diff_features != 0]
df_diff_features

In [None]:
df_diff_features.plot(kind='barh', figsize=(10, 10));
plt.title('Categorical Features Differences Training/Testing Set')
plt.show()

We can see there are some differences between the unique values from training set to testing set. It's important to assess the differences since we want our model to make predictions on similar data. It is tough to predict values if we don't have training examples.

#### Dataset Statistics

In [None]:
# describe the dataset
df_train[continuous_features + ['SalePrice']].describe().T

### #2 Cleaning the Data

* The dataset is pretty clean but we do need to fill in the missing values and remove potential outliers that can affect our Linear Model.

#### Fillna for categorical variables

From the dataset description, the missing values in these categorical features: 'Alley', 'GarageType', 'MiscFeature', 'FireplaceQu', 'Fence', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'PoolQC', 'GarageFinish', 'GarageQual', 'GarageCond', means `None`

In [None]:
for feat in ['Alley', 'GarageType', 'MiscFeature', 'FireplaceQu', 'Fence']:
    # fill NaNs
    df_train[feat].fillna(f'No{feat}', inplace=True)
    df_test[feat].fillna(f'No{feat}', inplace=True)
    print(f'{feat}...done')

In [None]:
# test for training set
df_train[['Alley', 'GarageType', 'MiscFeature', 'FireplaceQu', 'Fence']].isnull().sum()

In [None]:
# check for testing set 
df_test[['Alley', 'GarageType', 'MiscFeature', 'FireplaceQu', 'Fence']].isnull().sum()

In [None]:
# fill for no basement
for feat in ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']:
    # fill NaNs
    df_train[feat].fillna(f'NoBasement', inplace=True)
    df_test[feat].fillna(f'NoBasement', inplace=True)
    print(f'{feat}...done')

In [None]:
# test for train
df_train[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum()

In [None]:
# and testing set
df_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum()

In [None]:
# fill for no pool
df_train['PoolQC'].fillna(f'NoPool', inplace=True)
df_test['PoolQC'].fillna(f'NoPool', inplace=True)

In [None]:
df_train['PoolQC'].isnull().sum(), df_test['PoolQC'].isnull().sum()

In [None]:
# fill for no garage
for feat in ['GarageFinish', 'GarageQual', 'GarageCond']:
    # fill NaNs
    df_train[feat].fillna(f'NoGarage', inplace=True)
    df_test[feat].fillna(f'NoGarage', inplace=True)
    print(f'{feat}...done')

In [None]:
# test for train
df_train[['GarageFinish', 'GarageQual', 'GarageCond']].isnull().sum()

In [None]:
# and testing set
df_test[['GarageFinish', 'GarageQual', 'GarageCond']].isnull().sum()

In [None]:
# check again for missing values
df_test[nominal_features+ordinal_features].isnull().sum()[df_test.isnull().sum() > 0]

In [None]:
# check for missing values
df_train[nominal_features+ordinal_features].isnull().sum()[df_train.isnull().sum() > 0]

We are left now with the above variables to fill in missing values for categorical features. We can see there are more missing values in the testing set than in the training set. Let's see if we can figure out why they are not filled.

After looking online, I found [here]( http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls ) the original version of the dataset and I could easily extract the information for the `MSZoning` and other features.
Let's make the changes manually.

#### MsZoning

In [None]:
# let's see the entries
df_test[df_test['MSZoning'].isnull()]

In [None]:
df_test.loc[df_test[df_test['MSZoning'].isnull()].index, 'MSZoning']

In [None]:
# index variable
MSZoning_null_index = list(df_test[df_test['MSZoning'].isnull()].index)

In [None]:
df_test.loc[MSZoning_null_index[0], 'MSZoning'] = 'I'
df_test.loc[MSZoning_null_index[1], 'MSZoning'] = 'A'
df_test.loc[MSZoning_null_index[2], 'MSZoning'] = 'A'
df_test.loc[MSZoning_null_index[3], 'MSZoning'] = 'I'

In [None]:
# test the changes
df_test['MSZoning'].isnull().sum()

In [None]:
df_train['MSZoning'].value_counts()

In [None]:
df_test['MSZoning'].value_counts()

Differences 
Because there are no Industrial and Agricultural examples in our testing set, I will reassign these to 
* I -> RH (Residential High Density)
* A -> RL (Residential Low Density)

In [None]:
# reassign values
df_test.loc[MSZoning_null_index[0], 'MSZoning'] = 'RH'
df_test.loc[MSZoning_null_index[1], 'MSZoning'] = 'RL'
df_test.loc[MSZoning_null_index[2], 'MSZoning'] = 'RL'
df_test.loc[MSZoning_null_index[3], 'MSZoning'] = 'RH'

In [None]:
# plot categorical feature differences between training and testing set
def plot_bar(feature):
    width = 0.35
    ind = np.arange(df_test[feature].value_counts().shape[0])
    locations = ind + width / 2 # ytick locations
    labels = list(df_test[feature].value_counts().index) # ytick labels

    heights_test = list(df_test[feature].value_counts().values)
    heights_train = list(df_train[feature].value_counts().values)
    plot_test = plt.bar(ind, heights_test, width, label='Test')
    plot_train = plt.bar(ind + width, heights_train, width, label='Train')

    plt.title('{} Bar Chart'.format(feature))
    plt.xlabel('{}'.format(feature))
    plt.ylabel('')
    plt.xticks(locations, labels)

    plt.legend()
    plt.show()

In [None]:
plot_bar('MSZoning')

From the above bar chart we can see the two distribution for zone classification are similar.

#### Utilities

In [None]:
df_test[df_test['Utilities'].isnull()]

In [None]:
# index variable
Utilities_null_index = list(df_test[df_test['Utilities'].isnull()].index)
Utilities_null_index

In [None]:
# assign the new values
df_test.loc[Utilities_null_index, 'Utilities'] = 'NoSewr'

In [None]:
# test the changes
df_test['Utilities'].isnull().sum()

In [None]:
plot_bar('Utilities')

In [None]:
df_test['Utilities'].value_counts()

In [None]:
df_train['Utilities'].value_counts()

Here there are some differences between the two features, with two `NoSewr` values in the testing set and one `NoSeWa` in the training set.

#### Exterior1st

In [None]:
df_test[df_test['Exterior1st'].isnull()]

In [None]:
# index variable
Exterior1st_null_index = df_test[df_test['Exterior1st'].isnull()].index[0]

This observation has more missing values. Let's fill them all.

In [None]:
# reassign values
df_test.loc[Exterior1st_null_index, 'Exterior1st'] ='PreCast'
df_test.loc[Exterior1st_null_index, 'Exterior2nd'] ='PreCast'
df_test.loc[Exterior1st_null_index, 'GarageYrBlt'] ='NoGarage'

In [None]:
df_test[df_test['Exterior2nd'].isnull()].shape

#### SaleType

In [None]:
df_test[df_test['SaleType'].isnull()]

In [None]:
# index variable
SaleType_null_index = df_test[df_test['SaleType'].isnull()].index[0]

In [None]:
df_test.loc[SaleType_null_index, 'SaleType'] ='VWD'

In [None]:
df_test[df_test['SaleType'].isnull()]

In [None]:
df_test['SaleType'].value_counts()

In [None]:
df_train['SaleType'].value_counts()

In [None]:
# I'll put it in the Oth category in order to have similar structure
df_test.loc[SaleType_null_index, 'SaleType'] ='Oth'

In [None]:
plot_bar('SaleType')

#### KitchenQual

In [None]:
df_test[df_test['KitchenQual'].isnull()]

In [None]:
# index variable
KitchenQual_null_index = df_test[df_test['KitchenQual'].isnull()].index[0]

In [None]:
#df_test.loc[KitchenQual_null_index, 'KitchenQual'] = 'Po'
# reassign value to match distributions
df_test.loc[KitchenQual_null_index, 'KitchenQual'] = 'Fa'

In [None]:
df_test['KitchenQual'].value_counts()

In [None]:
df_train['KitchenQual'].value_counts()

#### Functional

In [None]:
df_test[df_test['Functional'].isnull()]

In [None]:
# index variable
Functional_null_index = list(df_test[df_test['Functional'].isnull()].index)

In [None]:
df_test.loc[Functional_null_index[0], 'Functional'] = 'Sev'
df_test.loc[Functional_null_index[1], 'Functional'] = 'Sev'

These where Sal (Salvage Only) but in order to have the same structure I put them in Sev (Severely Damaged)

In [None]:
df_train['Functional'].value_counts()

In [None]:
df_test['Functional'].value_counts()

In [None]:
plot_bar('Functional')

#### Electrical

In [None]:
df_train[df_train['Electrical'].isnull()]

In [None]:
# index variable
Electrical_null_index = df_train[df_train['Electrical'].isnull()].index[0]

In [None]:
df_train.loc[Electrical_null_index, 'Electrical'] = 'SBrkr'

In [None]:
df_train['Electrical'].value_counts()

In [None]:
df_test['Electrical'].value_counts()

#### Nulls in Numeric variables

There are also Nuls in continuous variables:
* `LotFrontage`, `MasVnrArea`, `BsmtFinSF1`, `BsmtFinSF2`, `BsmtUnfSF`, `TotalBsmtSF`, `GarageArea`: maybe fill them with the mean values to not affect the distributions of values.

Discrete Null values:
* `BsmtFullBath`, `BsmtHalfBath`, `GarageYrBlt`, `GarageCars`

#### MasVnrType  and MasVnrArea

* MasVnrArea - None
* MasVnrType - 0

In [None]:
df_train[df_train['MasVnrType'].isnull()]

In [None]:
df_test[df_test['MasVnrType'].isnull()]

In [None]:
# index variable
MasVnrType_null_index = list(df_train[df_train['MasVnrType'].isnull()].index)
MasVnrType_null_index

The properties with `None`, `MasVnrType` have `0`, `MasVnrArea`.

In [None]:
# assign values
df_train.loc[MasVnrType_null_index, 'MasVnrArea'] = 0
df_train.loc[MasVnrType_null_index, 'MasVnrType'] = 'None'

In [None]:
MasVnrType_null_index = list(df_test[df_test['MasVnrType'].isnull()].index)

df_test.loc[MasVnrType_null_index, 'MasVnrArea'] = 0
df_test.loc[MasVnrType_null_index, 'MasVnrType'] = 'None'

In [None]:
plot_bar('MasVnrType')

#### Continuous Features Nulls

In [None]:
df_test[continuous_features + discrete_features].isnull().sum()[df_test[continuous_features + discrete_features].isnull().sum() > 0]

In [None]:
df_train[continuous_features + discrete_features].isnull().sum()[df_train[continuous_features + discrete_features].isnull().sum() > 0]

#### LotFrontage

In [None]:
# Fill LotFrontage with mean
df_train['LotFrontage'].describe()

In [None]:
plt.hist(df_train['LotFrontage'], bins=30, alpha=0.5, label='Train set')
plt.hist(df_test['LotFrontage'], bins=30, alpha=0.5, label='Test set')

plt.title("LotFrontage Histogram Train/Test")
plt.xlabel('LotFrontage ($ft$)')
plt.ylabel('Frequency')

plt.legend()
plt.show()

In [None]:
# median value LotFrontage
LotFrontage_null_fill = df_train['LotFrontage'].mode()[0]
# fill nans for training and testing set
df_train['LotFrontage'].fillna(LotFrontage_null_fill, inplace=True)
df_test['LotFrontage'].fillna(LotFrontage_null_fill, inplace=True)

In [None]:
df_train['LotFrontage'].isnull().sum(), df_test['LotFrontage'].isnull().sum()

#### GarageYrBlt

In [None]:
df_train[df_train['GarageYrBlt'].isnull()][:10]

In [None]:
GarageYrBlt_null_vals_train = list(df_train[df_train['GarageYrBlt'].isnull()].index)
GarageYrBlt_null_vals_test = list(df_test[df_test['GarageYrBlt'].isnull()].index)

In [None]:
df_train.loc[GarageYrBlt_null_vals_train, 'GarageYrBlt'] = 'NoGarage'
df_test.loc[GarageYrBlt_null_vals_test, 'GarageYrBlt'] = 'NoGarage'

In [None]:
df_train['GarageYrBlt'].isnull().sum(), df_test['GarageYrBlt'].isnull().sum()

In [None]:
df_test[continuous_features + discrete_features].isnull().sum()[df_test[continuous_features + discrete_features].isnull().sum() > 0]

#### GarageArea & GarageCars

In [None]:
df_test[df_test['GarageArea'].isnull()]

In [None]:
GarageArea_null_index = df_test[df_test['GarageArea'].isnull()].index[0]

In [None]:
df_test.loc[GarageArea_null_index, 'GarageCars'] = 0
df_test.loc[GarageArea_null_index, 'GarageArea'] = 0

#### BsmtFinSF1, BsmtFinSF2, BsmtUnfSF,  TotalBsmtSF, BsmtFullBath, BsmtHalfBath

In [None]:
df_test[df_test['BsmtFullBath'].isnull()]

In [None]:
BsmtFullBath_nulls_index = list(df_test[df_test['BsmtFullBath'].isnull()].index)
mask_bsm = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',  'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']
df_test.loc[BsmtFullBath_nulls_index, mask_bsm] = 0

In [None]:
df_test[mask_bsm].isnull().sum()

In [None]:
# check to see if we cleaned for nulls
df_train.isnull().sum()[df_train.isnull().sum() > 0], df_test.isnull().sum()[df_train.isnull().sum() > 0]

#### Remove Outliers

* Remove from the training set and testing set the observations with a `GrLivArea` > 4,000. It is important to delete these values as Linear Regression is sensitive to outliers. 
* We have to be aware that there is an extreme data point in the testing set as well which we cannot remove.  

In [None]:
# let's see the four observations with GrLivArea bigger the 4000 
# from the training set
df_test[df_test['GrLivArea'] > 4000]

In [None]:
# let's see the four observations with GrLivArea bigger the 4000 
# from the training set
df_train[df_train['GrLivArea'] > 4000]

In [None]:
# drop the rows with extreme values
df_train.drop(df_train[df_train['GrLivArea'] > 4000].index, inplace=True)

In [None]:
# test the change
df_train[df_train['GrLivArea'] > 4000]

In [None]:
# reset index
df_train = df_train.reset_index(drop=True)
# check the shape of our dataframe
df_train.shape

In [None]:
# rename the index column
df_train.index.names = ['Id'] 

In [None]:
df_train.head()

In [None]:
# save this for later
df_train.to_csv(os.path.join('train_clean.csv'), sep=',', index_label='Id')
df_test.to_csv(os.path.join('test_clean.csv'), sep=',', index_label='Id')

In [None]:
df_train.head()

In [None]:
df_train = pd.read_csv(os.path.join( 
                                             'train_clean.csv'), index_col='Id')
df_test = pd.read_csv(os.path.join( 
                                             'test_clean.csv'), index_col='Id')
df_train.head()

<a id='eda'></a>
## Exploratory Data Analysis

Let’s visualize the information in our dataset by finding correlation between our variables and see how the data is distributed.

In [None]:
plt.hist(df_train['SalePrice'], bins=30)
plt.title("Sale Price Histogram")
plt.xlabel('SalePrice ($USD$)')
plt.ylabel('Frequency')

plt.show()

The Sale Price Histogram is right skewed, ranging from 34,900 USD to 755,000 USD. The median sale price is 163,000 which might indicate there are some potential outliers or extreme values that can cause bias to our model if we don't eliminate them.

In [None]:
plt.hist(np.log(df_train['SalePrice']), bins=30)
plt.title("Sale Price (Log Transformation) Histogram")
plt.xlabel('$log(SalePrice)$ ($USD$)')
plt.ylabel('Frequency')

plt.show()

Applying a log transformation on the Sale Price we can now see that our distribution is now more normal.

In [None]:
plt.hist(df_train['GrLivArea'], bins=30)
plt.title("Above Ground Living Area Square Feet Histogram")
plt.xlabel('GrLivArea ($ft^{2}$)')
plt.ylabel('Frequency')

plt.show()

`GrLivArea` (above ground living area square feet) ranges from 334 to a maximum value of 3627 $ft^{2}$. From the above histogram of Above Ground Living Area we can see that most properties have a living area between 1,128 and 1,775 $ft^{2}$, with a median 1,458 $ft^{2}$.  

In [None]:
plt.scatter(df_train.GrLivArea, df_train.SalePrice)
plt.title("Above Ground Living Area Square Feet vs Sale Price (training set)")
plt.xlabel('GrLivArea')
plt.ylabel('Sale Price')
plt.show()

Removing the outliers now we can better see the strong linear relation between Sale Price and Above Ground Living Area.

In [None]:
plt.hist(df_train['GrLivArea'], bins=30, alpha=0.5, label='Train set')
plt.hist(df_test['GrLivArea'], bins=30, alpha=0.5, label='Test set')

plt.title("Above Ground Living Area Square Feet Histogram Train/Test")
plt.xlabel('GrLivArea ($ft^{2}$)')
plt.ylabel('Frequency')

plt.legend()
plt.show()

From the above histogram, we can see that the training and testing set distributions of Above Ground Living Area are similar with the difference that in the testing set there is one extreme value.

#### Sale Price Over Time Period

In [None]:
# group data by year and month
df_time_price = df_train.groupby(['YrSold', 'MoSold'], as_index=False)['SalePrice'].mean()
# see the first rows
df_time_price.head()

We don't have 2010 data for all year round. Therefore, we need to create some 0 data for vizualization purpose and append it to our dataframe.  

In [None]:
mean_2010 = df_time_price[df_time_price['YrSold'] == 2010]['SalePrice'].mean()
mean_2010

In [None]:
# create data with median value
new_dummy_df2010 = pd.DataFrame(np.array([[2010, 8, mean_2010], [2010, 9, mean_2010], [2010, 10, mean_2010], 
                       [2010, 11, mean_2010], [2010, 12, mean_2010]]),
                   columns=['YrSold', 'MoSold', 'SalePrice'], index=[55, 56, 57, 58, 59])
new_dummy_df2010

In [None]:
# append the new data
df_time_price = df_time_price.append(new_dummy_df2010)

In [None]:
years = list(range(2006,2011))
labels = list(range(1, 13))

In [None]:
plt.figure(figsize=(8, 6))

for year in years:
    plt.plot(labels, df_time_price[df_time_price['YrSold'] == year]['SalePrice'], label=year)

plt.title('Sale Price During Time Period')
plt.xticks(labels)
plt.xlabel('Months')
plt.ylabel('Sale Price')
plt.legend()
plt.show()

From the above line plot we can see monthly average `SalePrice` from 2006 to 2010. We don't have data for 8-12 months of 2010, so they are filled with the mean value. The biggest sales with a mean value of more then 220,000 USD were recorded in September of 2006 while the worst month was in July of 2010. We can see that, typically, the last four months of the year are more profitable.

#### Correlation Matrix

In [None]:
# calculate correlation matrix
corr = df_train[continuous_features + ['SalePrice']].corr()
#Plot figsize
fig, ax = plt.subplots(figsize=(12, 12))
#Generate Color Map
colormap = sns.diverging_palette(220, 10, as_cmap=True)
#Generate Heat Map, allow annotations and place floats in map
g = sns.heatmap(corr, cmap=colormap, annot=True, fmt=".2f")
#Apply xticks
# plt.xticks(range(len(corr.columns)), corr.columns);
# #Apply yticks
# plt.yticks(range(len(corr.columns)), corr.columns)
# #show plot
plt.show()

From the above carrelation matrix, we can see that our target variable, `SalePrice`, has a strong positive linear relationship with `GrLivArea`. ALso, a moderate positive linear relationship with:  
* `TotalBsmtSF`(0.65)
* `GarageArea`(0.64)
* `1stFlrSF`(0.63) + `2ndFlrSF`(0.3) + `LowQualFinSF`= GrLivArea (that's why they correlate with each)
* `MasVnrArea`(0.47)
* `BsmtFinSF1` (0.4)
* `LotFrontage` (0.34)
* `OpenPorchSF` (0.33)
* `WoodDeckSF` (0.32)

BsmtFinSF2 + BsmtUnfSF = TotalBsmtSF

In [None]:
imp_cont_features = ['GrLivArea', 'TotalBsmtSF', 'GarageArea', 'MasVnrArea', 'LotFrontage', 
                     'OpenPorchSF', 'WoodDeckSF']

In [None]:
%config InlineBackend.figure_format = 'png' 
sm = pd.plotting.scatter_matrix(df_train[imp_cont_features + ['SalePrice']], figsize=(30, 30), diagonal='kde');

for ax in sm.ravel():
    ax.set_xlabel(ax.get_xlabel(), fontsize = 20, rotation = 45)
    ax.set_ylabel(ax.get_ylabel(), fontsize = 20, rotation = 0)

#May need to offset label when rotating to prevent overlap of figure
[s.get_yaxis().set_label_coords(-0.5,0.5) for s in sm.reshape(-1)]

#Hide all ticks
[s.set_xticks(()) for s in sm.reshape(-1)]
[s.set_yticks(()) for s in sm.reshape(-1)]
plt.show()

From the the above scatter matrix we can see the distributions of the features correlated to our target variable as well as  the density plot for each variable.

In [None]:
for feat in imp_cont_features:
    plt.scatter(df_train[feat], np.log(df_train.SalePrice))
    plt.xlabel(feat)
    plt.ylabel('$log(Sale Price)$')
    plt.show()

#### Baseline Model

Let's try to fit a Linear Regression by simply taking into consideration the neighborhood and total square footage:
* Neighborhood
* TotalBsmtSF + GrLivArea

#### TotalSquareFootage

In [None]:
df_train['TotalSquareFootage'] = df_train['GrLivArea'] + df_train['TotalBsmtSF']
df_test['TotalSquareFootage'] = df_test['GrLivArea'] + df_test['TotalBsmtSF']

In [None]:
plt.scatter(df_train['TotalSquareFootage'], np.log(df_train.SalePrice))
plt.xlabel('TotalSquareFootage')
plt.ylabel('$log(Sale Price)$')
plt.show()

In [None]:
df_train['logSalePrice'] = np.log(df_train.SalePrice)

In [None]:
# calculate correlation matrix
corr = df_train[['TotalSquareFootage', 'SalePrice']].corr()
#Plot figsize
fig, ax = plt.subplots(figsize=(3,3))
#Generate Color Map
colormap = sns.diverging_palette(220, 10, as_cmap=True)
#Generate Heat Map, allow annotations and place floats in map
g = sns.heatmap(corr, cmap=colormap, annot=True, fmt=".2f")
#Apply xticks
# plt.xticks(range(len(corr.columns)), corr.columns);
# #Apply yticks
# plt.yticks(range(len(corr.columns)), corr.columns)
# #show plot
plt.show()

We can see here that SalePrice has a strong positive linear relation with TotalSquareFootage, with a Correlation Coefficient of 0.82. I will use this continuous variable in my first simple model.

### Fit a Simple Linear Model

In [None]:
import statsmodels.api as sm

#### #1 Create an intercept

In [None]:
df_train['intercept'] = 1

In [None]:
X = df_train[['intercept', 'TotalSquareFootage']]
y = df_train['SalePrice']

In [None]:
# predicting the price and add all of our var that are quantitative
lm = sm.OLS(y, X)
results = lm.fit()
results.summary()

Based only on the total square footage we get a R-squared of 0.674. This means that 67,4 % of the variability in SalePrice is explained by `TotalSquareFootage`

In [None]:
# these are our cofficients for our function
np.dot(np.dot(np.linalg.inv(np.dot(X.transpose(), X)) , X.transpose()), y)

$yhat = 82.80489695x - 31594.19591877$

#### #Add Neighborhood Dummies

In [None]:
df_train['Neighborhood'].value_counts()

In [None]:
# create neighborhood dummies
neighborhood_dummies = pd.get_dummies(df_train['Neighborhood'])
neighborhood_dummies.head()

### Fit a regression model with Bloomington Heights as Baseline

In [None]:
# select all the columns but the first
neighborhood_columns = list(neighborhood_dummies.columns[1:])
neighborhood_dummies[neighborhood_columns].head()

In [None]:
X = X.join(neighborhood_dummies)
X.head()

In [None]:
y = df_train['SalePrice']

In [None]:
lm2 = sm.OLS(y, X[['intercept', 'TotalSquareFootage'] + neighborhood_columns])
results2 = lm2.fit()
results2.summary()

In [None]:
'{0:.10f}'.format(-4.119e+04)

#### Conclusions for Neighborhood Blmngtn

1. 79.6% of the variability in price can be explained by the linear model built using total square footage and neighborhood.
2. For each additional unit increase in TotalSquareFootage, the price is expected to increase by 62 dollars as long as all the other variables stay the same.
3. We expect that a house in NridgHt will cost 75310 more than a house in Blmngtn, all else being equal.
4.  We expect that a house in SWISU will cost 41190 less than a house in Blmngtn, all else being equal.

#### sklearn

* let's use the sklearn Ordinary Least Squares Linear Regression
* fit the logarithm of SalePrice: `logSalePrice`

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
X.drop(columns=['intercept'], inplace=True)
y = df_train['logSalePrice']

In [None]:
reg = LinearRegression()
# fit training data
reg.fit(X, y)
# get the R^2
reg.score(X, y)

In [None]:
# get the coefficients
reg.coef_

In [None]:
# get the intercept
reg.intercept_

In [None]:
# make predictions
pred = reg.predict(X)

#### Evaluate our base model

Let's calculate [Root-Mean-Squared-Error (RMSE)](https://en.wikipedia.org/wiki/Root-mean-square_deviation) 
* between the logarithm of the predicted value and the logarithm of the observed sales price. 

**Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.**

In [None]:
# calculate RMSE
def rmse(y, pred):
    return np.sqrt(mean_squared_error(y, pred))

In [None]:
# error
rmse(y, pred)

In [None]:
# calculate RMSE
# np.sqrt(mean_squared_error(y, pred))

#### Submit Predictions

In [None]:
# load the dataset
PATH_TO_DATA = '../input'

sub = pd.read_csv(os.path.join(PATH_TO_DATA, 
                                             'sample_submission.csv'), index_col='Id')

In [None]:
df_test.head()

In [None]:
# create neighborhood dummies
neighborhood_dummies_test = pd.get_dummies(df_test['Neighborhood'])
neighborhood_dummies_test.head()

In [None]:
X_test = df_test[['TotalSquareFootage']]
X_test = X_test.join(neighborhood_dummies_test)
X_test.head()

In [None]:
# make predictions
pred_test = reg.predict(X_test)
# exponentiate the results
pred_test = np.exp(pred_test)
pred_test[:10]

In [None]:
sub['SalePrice'] = pred_test

In [None]:
plt.hist(pred_test, bins=40);
plt.title('Distribution of SalePrice predictions');

In [None]:
sub.to_csv('model1.csv')
# load the dataset

model1_sub = pd.read_csv(os.path.join( 
                                             'model1.csv'), index_col='Id')
model1_sub.head() # 0.19363

<a id="conclusions"></a>
## Conclusions

Fitting the clean data to a simple **Linear Regression Model** in order to make a baseline model for further improvements. Using only two variables I was able to make a simple model with a **Coefficient of Determination  (R Squared)** of about 0.80. I first applied a log transformation on our target variable to make it normally distributed and then I fitted my input variables to the linear model. The two variables used in the regression are the Total Square Footage (`TotalBsmtS`F + `GrLivArea`) and the `Neighborhood`. On the second variable I used one-hot-encoding. The model was evaluated with `Root Mean Squared Error (RMSE)` with a value of about 0.17444 on the training set and 0.19363 on the testing set on the Kaggle House Prices Competition. 
This is just a baseline model which has great room for improvement and creativity on feature engineering. This model used only two features and in the dataset there are 79. Also, there are other models that should be used like XGBoost, CatBoost, LightGBM, ElasticNet and others. Stacking the results of these models and hyperparameter tuning are the next steps for a second more complex model with better predictions.