# Predicting House Sale Prices

In this project, we would be setting up a pipeline of functions that allow us to quickly iterate on different models. The dataset to be used in this project would be the housing data for the city of Ames, Iowa, United States from 2006 to 2010. The dataset is available for download [here](http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt), along with the accompanying data documentation that contains information on the different columns in the dataset [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).
The pipeline of functions would look something like this:
![Image](https://s3.amazonaws.com/dq-content/240/pipeline.svg)

# Introduction

In [1]:
# Importing the relevant libraries that are to be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

%matplotlib inline

In [2]:
# Reading the dataset into a dataframe
data = pd.read_csv('AmesHousing.tsv', sep='\t')

In [3]:
# Checking first few rows
data.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [4]:
# Checking overview of the columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Frontage       2440 non-null float64
Lot Area           2930 non-null int64
Street             2930 non-null object
Alley              198 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         29

From the overview, we can see that there are 2930 rows with 82 columns; majority of which are of object dtypes while the rest are numeric (either float or int dtypes).

# Creation of Functions in the Pipeline (Skeletal Base)

In [5]:
# Function to transform the feature columns
def transform_features(df):
    return df

In [6]:
# Function to select the appropriate feature columns for training
def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

In [7]:
# Function to split the dataset into a training set and a test set
# followed by instantiating a Linear Regression model, fitting and then predicting
# and returning the RMSE for the predictions versus the actual labels
def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    train_features = select_features(train)
    train_target = train['SalePrice']
    test_features = select_features(test)
    test_target = test['SalePrice']
    model = LinearRegression()
    model.fit(train_features, train_target)
    predictions = model.predict(test_features)
    mse = mean_squared_error(test_target, predictions)
    rmse = np.sqrt(mse)
    return rmse

# Feature Engineering

With the skeletal structure of the pipeline functions in place, we should looked into updating the above 3 functions necessary. First up, feature engineering which is related to the `transform_features` function. Features with many missing values should be revmoed - a cutoff of 25% is set such that the column is dropped if there are more than 25% missing values in the column. Potential categorial features should be explored further as well, and transforming text and numerical columns. Columns that leak information about the sale (e.g. like the year the sale happened) have to be removed as well. In general, the goal of this function is to:
* remove features that we don't want to use in the model, just based on the number of missing values or data leakage
* transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc)
* create new features by combining other features

## Exploration and Experimentation with Features

### Missing Values

First up, let's look at the missing values in all the columns.

In [8]:
null_values_count = data.isnull().sum()
null_values_count

Order                0
PID                  0
MS SubClass          0
MS Zoning            0
Lot Frontage       490
Lot Area             0
Street               0
Alley             2732
Lot Shape            0
Land Contour         0
Utilities            0
Lot Config           0
Land Slope           0
Neighborhood         0
Condition 1          0
Condition 2          0
Bldg Type            0
House Style          0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Roof Style           0
Roof Matl            0
Exterior 1st         0
Exterior 2nd         0
Mas Vnr Type        23
Mas Vnr Area        23
Exter Qual           0
Exter Cond           0
                  ... 
Bedroom AbvGr        0
Kitchen AbvGr        0
Kitchen Qual         0
TotRms AbvGrd        0
Functional           0
Fireplaces           0
Fireplace Qu      1422
Garage Type        157
Garage Yr Blt      159
Garage Finish      159
Garage Cars          1
Garage Area          1
Garage Qual

In [9]:
null_values_pct = null_values_count[null_values_count != 0]/len(data) * 100
null_values_to_drop = null_values_pct[null_values_pct > 25]
null_values_to_drop

Alley           93.242321
Fireplace Qu    48.532423
Pool QC         99.556314
Fence           80.477816
Misc Feature    96.382253
dtype: float64

The above 5 columns have more than 25% missing values and so, ought to be dropped.

In [10]:
columns_to_drop = null_values_to_drop.index.tolist()
columns_to_drop

['Alley', 'Fireplace Qu', 'Pool QC', 'Fence', 'Misc Feature']

### Data Leakage

Next up, columns that leak information about the sale should also be dropped. Data leakage occurs when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict. That is, the information we would want to train the model on, is actually not available at the point of prediction. Notably, features that would leak to data leakage are information on the actual sale. The columns are `Mo Sold`, `Yr Sold`, `Sale Type` and `Sale Condition`.

In [11]:
columns_to_drop += ['Mo Sold', 'Yr Sold', 'Sale Type', 'Sale Condition']
columns_to_drop

['Alley',
 'Fireplace Qu',
 'Pool QC',
 'Fence',
 'Misc Feature',
 'Mo Sold',
 'Yr Sold',
 'Sale Type',
 'Sale Condition']

### Irrelevant Columns

Some columns are not useful for machine learning and so, they should be dropped. These columns are usually unique identifiers for each row which do not provide any useful information with regards to the sale price. They are the `Order` and `PID` columns.

In [12]:
columns_to_drop += ['Order', 'PID']
columns_to_drop

['Alley',
 'Fireplace Qu',
 'Pool QC',
 'Fence',
 'Misc Feature',
 'Mo Sold',
 'Yr Sold',
 'Sale Type',
 'Sale Condition',
 'Order',
 'PID']

### Handling Numeric and Text Columns Separately

Now, before transforming some of the existing features to engineer new features that could better capture information, the numeric and text columns need to be handled separately. Firstly, text columns are harder to deal with and so, any columns with missing values would be dropped for now. As for the remaining numeric columns, the missing values would be imputed with the mean within the column.

In [13]:
# Dropping the columns from before first on a copy of the dataframe to avoid making perm changes
# to the original dataframe
copy = data.copy()
copy.drop(columns = columns_to_drop, inplace=True)

In [14]:
text_cols = copy.select_dtypes(include='object')
text_cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 36 columns):
MS Zoning         2930 non-null object
Street            2930 non-null object
Lot Shape         2930 non-null object
Land Contour      2930 non-null object
Utilities         2930 non-null object
Lot Config        2930 non-null object
Land Slope        2930 non-null object
Neighborhood      2930 non-null object
Condition 1       2930 non-null object
Condition 2       2930 non-null object
Bldg Type         2930 non-null object
House Style       2930 non-null object
Roof Style        2930 non-null object
Roof Matl         2930 non-null object
Exterior 1st      2930 non-null object
Exterior 2nd      2930 non-null object
Mas Vnr Type      2907 non-null object
Exter Qual        2930 non-null object
Exter Cond        2930 non-null object
Foundation        2930 non-null object
Bsmt Qual         2850 non-null object
Bsmt Cond         2850 non-null object
Bsmt Exposure     2847 non-null obj

In [15]:
# Determining the number of null values in the text columns
text_cols_mv = text_cols.isnull().sum()
text_cols_mv

MS Zoning           0
Street              0
Lot Shape           0
Land Contour        0
Utilities           0
Lot Config          0
Land Slope          0
Neighborhood        0
Condition 1         0
Condition 2         0
Bldg Type           0
House Style         0
Roof Style          0
Roof Matl           0
Exterior 1st        0
Exterior 2nd        0
Mas Vnr Type       23
Exter Qual          0
Exter Cond          0
Foundation          0
Bsmt Qual          80
Bsmt Cond          80
Bsmt Exposure      83
BsmtFin Type 1     80
BsmtFin Type 2     81
Heating             0
Heating QC          0
Central Air         0
Electrical          1
Kitchen Qual        0
Functional          0
Garage Type       157
Garage Finish     159
Garage Qual       159
Garage Cond       159
Paved Drive         0
dtype: int64

In [16]:
# Filtering for text columns with missing values and then dropping these columns
text_columns_to_drop = text_cols_mv[text_cols_mv > 0].index
text_columns_to_drop

Index(['Mas Vnr Type', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin Type 2', 'Electrical', 'Garage Type',
       'Garage Finish', 'Garage Qual', 'Garage Cond'],
      dtype='object')

In [17]:
# Dropping these text columns with the copy of the dataframe
copy.drop(columns = text_columns_to_drop, inplace=True)
copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 60 columns):
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Frontage       2440 non-null float64
Lot Area           2930 non-null int64
Street             2930 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         2930 non-null object
Roof Matl          2930 non-null object
Exterior 1st       2930 non-null object
Exterior 2nd      

With the text columns out of the way, it's time to impute the missing values in the remaining numeric columns with the mean.

In [18]:
numeric_cols = copy.select_dtypes(include=['float', 'int'])
numeric_cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 35 columns):
MS SubClass        2930 non-null int64
Lot Frontage       2440 non-null float64
Lot Area           2930 non-null int64
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Mas Vnr Area       2907 non-null float64
BsmtFin SF 1       2929 non-null float64
BsmtFin SF 2       2929 non-null float64
Bsmt Unf SF        2929 non-null float64
Total Bsmt SF      2929 non-null float64
1st Flr SF         2930 non-null int64
2nd Flr SF         2930 non-null int64
Low Qual Fin SF    2930 non-null int64
Gr Liv Area        2930 non-null int64
Bsmt Full Bath     2928 non-null float64
Bsmt Half Bath     2928 non-null float64
Full Bath          2930 non-null int64
Half Bath          2930 non-null int64
Bedroom AbvGr      2930 non-null int64
Kitchen AbvGr      2930 non-null int64
TotRms AbvGrd      

In [19]:
numeric_cols.mean()

MS SubClass            57.387372
Lot Frontage           69.224590
Lot Area            10147.921843
Overall Qual            6.094881
Overall Cond            5.563140
Year Built           1971.356314
Year Remod/Add       1984.266553
Mas Vnr Area          101.896801
BsmtFin SF 1          442.629566
BsmtFin SF 2           49.722431
Bsmt Unf SF           559.262547
Total Bsmt SF        1051.614544
1st Flr SF           1159.557679
2nd Flr SF            335.455973
Low Qual Fin SF         4.676792
Gr Liv Area          1499.690444
Bsmt Full Bath          0.431352
Bsmt Half Bath          0.061134
Full Bath               1.566553
Half Bath               0.379522
Bedroom AbvGr           2.854266
Kitchen AbvGr           1.044369
TotRms AbvGrd           6.443003
Fireplaces              0.599317
Garage Yr Blt        1978.132443
Garage Cars             1.766815
Garage Area           472.819734
Wood Deck SF           93.751877
Open Porch SF          47.533447
Enclosed Porch         23.011604
3Ssn Porch

In [20]:
numeric_cols_imputed = numeric_cols.fillna(numeric_cols.mean())
numeric_cols_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 35 columns):
MS SubClass        2930 non-null int64
Lot Frontage       2930 non-null float64
Lot Area           2930 non-null int64
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Mas Vnr Area       2930 non-null float64
BsmtFin SF 1       2930 non-null float64
BsmtFin SF 2       2930 non-null float64
Bsmt Unf SF        2930 non-null float64
Total Bsmt SF      2930 non-null float64
1st Flr SF         2930 non-null int64
2nd Flr SF         2930 non-null int64
Low Qual Fin SF    2930 non-null int64
Gr Liv Area        2930 non-null int64
Bsmt Full Bath     2930 non-null float64
Bsmt Half Bath     2930 non-null float64
Full Bath          2930 non-null int64
Half Bath          2930 non-null int64
Bedroom AbvGr      2930 non-null int64
Kitchen AbvGr      2930 non-null int64
TotRms AbvGrd      

In [21]:
# Updating back the copy with the imputed values
copy[numeric_cols_imputed.columns] = numeric_cols_imputed

In [22]:
copy.head()

Unnamed: 0,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,...,Garage Area,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice
0,20,RL,141.0,31770,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,528.0,P,210,62,0,0,0,0,0,215000
1,20,RH,80.0,11622,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,730.0,Y,140,0,0,0,120,0,0,105000
2,20,RL,81.0,14267,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,312.0,Y,393,36,0,0,0,0,12500,172000
3,20,RL,93.0,11160,Pave,Reg,Lvl,AllPub,Corner,Gtl,...,522.0,Y,0,0,0,0,0,0,0,244000
4,60,RL,74.0,13830,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,482.0,Y,212,34,0,0,0,0,0,189900


In [23]:
copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 60 columns):
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Frontage       2930 non-null float64
Lot Area           2930 non-null int64
Street             2930 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         2930 non-null object
Roof Matl          2930 non-null object
Exterior 1st       2930 non-null object
Exterior 2nd      

In [24]:
# Checking for null values after updating
copy.isnull().sum()

MS SubClass        0
MS Zoning          0
Lot Frontage       0
Lot Area           0
Street             0
Lot Shape          0
Land Contour       0
Utilities          0
Lot Config         0
Land Slope         0
Neighborhood       0
Condition 1        0
Condition 2        0
Bldg Type          0
House Style        0
Overall Qual       0
Overall Cond       0
Year Built         0
Year Remod/Add     0
Roof Style         0
Roof Matl          0
Exterior 1st       0
Exterior 2nd       0
Mas Vnr Area       0
Exter Qual         0
Exter Cond         0
Foundation         0
BsmtFin SF 1       0
BsmtFin SF 2       0
Bsmt Unf SF        0
Total Bsmt SF      0
Heating            0
Heating QC         0
Central Air        0
1st Flr SF         0
2nd Flr SF         0
Low Qual Fin SF    0
Gr Liv Area        0
Bsmt Full Bath     0
Bsmt Half Bath     0
Full Bath          0
Half Bath          0
Bedroom AbvGr      0
Kitchen AbvGr      0
Kitchen Qual       0
TotRms AbvGrd      0
Functional         0
Fireplaces   

All the null values of the remaining columns have be dealt with. Now, it's time to engineer some new features based on the existing features.

### Engineer New Features

One possible new feature that can be created would be the `Years Till Remod` which is a difference between the `Year Remod/Add` and `Year Built` columns.

In [25]:
years_till_remod = copy['Year Remod/Add'] - copy['Year Built']

In [26]:
# Dropping these 2 original columns and adding new column for this
copy['Years Till Remod'] = years_till_remod
copy.drop(columns=['Year Remod/Add', 'Year Built'], inplace=True)
copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 59 columns):
MS SubClass         2930 non-null int64
MS Zoning           2930 non-null object
Lot Frontage        2930 non-null float64
Lot Area            2930 non-null int64
Street              2930 non-null object
Lot Shape           2930 non-null object
Land Contour        2930 non-null object
Utilities           2930 non-null object
Lot Config          2930 non-null object
Land Slope          2930 non-null object
Neighborhood        2930 non-null object
Condition 1         2930 non-null object
Condition 2         2930 non-null object
Bldg Type           2930 non-null object
House Style         2930 non-null object
Overall Qual        2930 non-null int64
Overall Cond        2930 non-null int64
Roof Style          2930 non-null object
Roof Matl           2930 non-null object
Exterior 1st        2930 non-null object
Exterior 2nd        2930 non-null object
Mas Vnr Area        2930 non-null f

In [27]:
# Checking for validity of the new `Years Till Remod` to ensure no negative values,
# otherwise row would be dropped
copy[copy['Years Till Remod'] < 0]

Unnamed: 0,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,...,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice,Years Till Remod
850,20,RL,65.0,10739,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,Y,144,40,0,0,0,0,0,203000,-1


In [28]:
copy.drop(copy[copy['Years Till Remod'] < 0].index, axis=0, inplace=True)
copy.reset_index(drop=True, inplace=True)
copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2929 entries, 0 to 2928
Data columns (total 59 columns):
MS SubClass         2929 non-null int64
MS Zoning           2929 non-null object
Lot Frontage        2929 non-null float64
Lot Area            2929 non-null int64
Street              2929 non-null object
Lot Shape           2929 non-null object
Land Contour        2929 non-null object
Utilities           2929 non-null object
Lot Config          2929 non-null object
Land Slope          2929 non-null object
Neighborhood        2929 non-null object
Condition 1         2929 non-null object
Condition 2         2929 non-null object
Bldg Type           2929 non-null object
House Style         2929 non-null object
Overall Qual        2929 non-null int64
Overall Cond        2929 non-null int64
Roof Style          2929 non-null object
Roof Matl           2929 non-null object
Exterior 1st        2929 non-null object
Exterior 2nd        2929 non-null object
Mas Vnr Area        2929 non-null f

## Updating `transform_features` Function

Now, the transform_features function can be updated to incorporate all we've went through above.

In [29]:
# Function to transform the feature columns
def transform_features(df):
    copy = df.copy()
    
    null_values_count = df.isnull().sum()
    null_values_pct = null_values_count[null_values_count != 0]/len(df) * 100
    null_values_to_drop = null_values_pct[null_values_pct > 25]
    columns_to_drop = null_values_to_drop.index.tolist()
    columns_to_drop += ['Mo Sold', 'Yr Sold', 'Sale Type', 'Sale Condition']
    columns_to_drop += ['Order', 'PID']
    copy.drop(columns = columns_to_drop, inplace=True)
    
    text_cols = copy.select_dtypes(include='object')
    text_cols_mv = text_cols.isnull().sum()
    text_columns_to_drop = text_cols_mv[text_cols_mv > 0].index
    copy.drop(columns = text_columns_to_drop, inplace=True)
    
    numeric_cols = copy.select_dtypes(include=['float', 'int'])
    numeric_cols_imputed = numeric_cols.fillna(numeric_cols.mean())
    copy[numeric_cols_imputed.columns] = numeric_cols_imputed
    
    years_till_remod = copy['Year Remod/Add'] - copy['Year Built']
    copy['Years Till Remod'] = years_till_remod
    copy.drop(columns=['Year Remod/Add', 'Year Built'], inplace=True)
    copy.drop(copy[copy['Years Till Remod'] < 0].index, axis=0, inplace=True)
    copy.reset_index(drop=True, inplace=True)
    
    return copy

# Feature Selection

With feature engineering out of the way, it's time to look to what are the appropriate features to select to be included in the linear regression model. Let's apply the transformation function just created to the original dataframe to create a copy of the transformed dataframe to work with for the feature selection first.

In [30]:
transformed_data = transform_features(data)
transformed_data.head()

Unnamed: 0,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,...,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice,Years Till Remod
0,20,RL,141.0,31770,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,P,210,62,0,0,0,0,0,215000,0
1,20,RH,80.0,11622,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,Y,140,0,0,0,120,0,0,105000,0
2,20,RL,81.0,14267,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,Y,393,36,0,0,0,0,12500,172000,0
3,20,RL,93.0,11160,Pave,Reg,Lvl,AllPub,Corner,Gtl,...,Y,0,0,0,0,0,0,0,244000,0
4,60,RL,74.0,13830,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,Y,212,34,0,0,0,0,0,189900,1


## Investigating Correlation of Numeric Columns with Target Column

It might to worthwhile to only model numeric features that exhibit strong linear correlation with the target `SalePrice` only.

In [31]:
numeric = transformed_data.select_dtypes(include=['float', 'int'])
correlations_with_price = numeric.corr()['SalePrice']
abs_correlations_with_price = correlations_with_price.abs().sort_values(ascending=True)
abs_correlations_with_price

BsmtFin SF 2        0.005918
Misc Val            0.015683
3Ssn Porch          0.032235
Bsmt Half Bath      0.035792
Low Qual Fin SF     0.037651
Pool Area           0.068410
MS SubClass         0.085021
Overall Cond        0.101655
Screen Porch        0.112181
Kitchen AbvGr       0.119797
Enclosed Porch      0.128758
Bedroom AbvGr       0.143899
Bsmt Unf SF         0.182915
Years Till Remod    0.240129
Lot Area            0.266546
2nd Flr SF          0.269479
Bsmt Full Bath      0.275850
Half Bath           0.285159
Open Porch SF       0.312966
Wood Deck SF        0.327119
Lot Frontage        0.340777
BsmtFin SF 1        0.432867
Fireplaces          0.474722
TotRms AbvGrd       0.495514
Mas Vnr Area        0.505812
Garage Yr Blt       0.510681
Full Bath           0.545594
1st Flr SF          0.621671
Total Bsmt SF       0.632112
Garage Area         0.640374
Garage Cars         0.647851
Gr Liv Area         0.706801
Overall Qual        0.799268
SalePrice           1.000000
Name: SalePric

As a cutoff, columns with absolute correlation value of less than 0.4 ought to be removed. These columns with relatively low correlation coefficient would likely lead to poor predictive capability with respective to the sale price.

In [32]:
transformed_data.drop(abs_correlations_with_price[abs_correlations_with_price < 0.4].index, axis=1, inplace=True)
transformed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2929 entries, 0 to 2928
Data columns (total 38 columns):
MS Zoning        2929 non-null object
Street           2929 non-null object
Lot Shape        2929 non-null object
Land Contour     2929 non-null object
Utilities        2929 non-null object
Lot Config       2929 non-null object
Land Slope       2929 non-null object
Neighborhood     2929 non-null object
Condition 1      2929 non-null object
Condition 2      2929 non-null object
Bldg Type        2929 non-null object
House Style      2929 non-null object
Overall Qual     2929 non-null int64
Roof Style       2929 non-null object
Roof Matl        2929 non-null object
Exterior 1st     2929 non-null object
Exterior 2nd     2929 non-null object
Mas Vnr Area     2929 non-null float64
Exter Qual       2929 non-null object
Exter Cond       2929 non-null object
Foundation       2929 non-null object
BsmtFin SF 1     2929 non-null float64
Total Bsmt SF    2929 non-null float64
Heating          

## Categorical Columns

Columns that can be categorized as nominal variables are candidates that ought to be converted to the categorical data type. Referencing the data documentation and current columns in the `transformed_data`, a list of the candidates to be converted to categorical datatypes is created. Candidates here include ordinal variables but are not in numeric form as well.

In [33]:
cat_candidates = ['MS Zoning', 'Street', 'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config',
                  'Land Slope', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style',
                  'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Exter Qual', 'Exter Cond',
                  'Foundation', 'Heating', 'Heating QC', 'Central Air', 'Kitchen Qual', 'Functional',
                  'Paved Drive']
len(cat_candidates)

25

In [34]:
## Calculating the number of unique values in each of the candidate for categorical columns
unique_val_counts = transformed_data[cat_candidates].apply(lambda col: len(col.unique())).sort_values()
unique_val_counts

Street           2
Central Air      2
Paved Drive      3
Utilities        3
Land Slope       3
Lot Shape        4
Land Contour     4
Exter Qual       4
Lot Config       5
Kitchen Qual     5
Heating QC       5
Bldg Type        5
Exter Cond       5
Heating          6
Foundation       6
Roof Style       6
MS Zoning        7
Roof Matl        8
Functional       8
House Style      8
Condition 2      8
Condition 1      9
Exterior 1st    16
Exterior 2nd    17
Neighborhood    28
dtype: int64

In [35]:
# Dropping these categorical columns if there are more than 10 unique values to avoid many too many
# dummy code columns when applying pd.get_dummies()
to_drop = unique_val_counts[unique_val_counts > 10].index
transformed_data.drop(columns=to_drop, inplace=True)
transformed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2929 entries, 0 to 2928
Data columns (total 35 columns):
MS Zoning        2929 non-null object
Street           2929 non-null object
Lot Shape        2929 non-null object
Land Contour     2929 non-null object
Utilities        2929 non-null object
Lot Config       2929 non-null object
Land Slope       2929 non-null object
Condition 1      2929 non-null object
Condition 2      2929 non-null object
Bldg Type        2929 non-null object
House Style      2929 non-null object
Overall Qual     2929 non-null int64
Roof Style       2929 non-null object
Roof Matl        2929 non-null object
Mas Vnr Area     2929 non-null float64
Exter Qual       2929 non-null object
Exter Cond       2929 non-null object
Foundation       2929 non-null object
BsmtFin SF 1     2929 non-null float64
Total Bsmt SF    2929 non-null float64
Heating          2929 non-null object
Heating QC       2929 non-null object
Central Air      2929 non-null object
1st Flr SF       

Taking a glance, there appears to be no columns that are currently numerical but need to be encoded as categorical. There are no such columns where the numbers do not have any semantic meaning. With that, we can now proceed to convert the text columns to categorical columns before creating the dummy code columns.

In [36]:
## Select just the remaining text columns and convert to categorical
text_cols = transformed_data.select_dtypes(include=['object'])
for col in text_cols:
    transformed_data[col] = transformed_data[col].astype('category')
    
## Create dummy columns and add back to the dataframe
transformed_data = pd.concat([
    transformed_data, 
    pd.get_dummies(transformed_data.select_dtypes(include=['category']))
], axis=1).drop(text_cols,axis=1)

## Updating the `select_features` Function

With the experimentation above, the logic can now be incorporated into a single function for selecting the features to be trained with the Linear Regression model.

In [37]:
# Function to select the appropriate feature columns for training
def select_features(df):
    numeric = df.select_dtypes(include=['float', 'int'])
    correlations_with_price = numeric.corr()['SalePrice']
    abs_correlations_with_price = correlations_with_price.abs().sort_values(ascending=True)
    df.drop(abs_correlations_with_price[abs_correlations_with_price < 0.4].index, axis=1, inplace=True)
    
    cat_candidates = ['MS Zoning', 'Street', 'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config',
                  'Land Slope', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style',
                  'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Exter Qual', 'Exter Cond',
                  'Foundation', 'Heating', 'Heating QC', 'Central Air', 'Kitchen Qual', 'Functional',
                  'Paved Drive']
    unique_val_counts = df[cat_candidates].apply(lambda col: len(col.unique())).sort_values()
    to_drop = unique_val_counts[unique_val_counts > 10].index
    df.drop(columns=to_drop, inplace=True)
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')
    
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))
    ], axis=1).drop(text_cols,axis=1)
    
    return df

In [38]:
# Trying out the newly created function to see if the output dataframe consists of all numeric columns
try_df = transform_features(data)
try_df = select_features(try_df)
try_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2929 entries, 0 to 2928
Columns: 129 entries, Overall Qual to Paved Drive_Y
dtypes: float64(6), int64(7), uint8(116)
memory usage: 629.4 KB


# Training and Testing

## Updating the `train_and_test` function

Finally, the `train_and_test` function has to be updated to allow for different type of cross-validation, allowed for an input `k` to control the type of cross validation to be used. If k is greater than 2, K-Fold validation would be applied. Otherwise, if k=1, simple train/test validation is applied. If k=2, holdout validation is used.

In [39]:
def train_and_test(df, k=1):
    features = df.columns.drop("SalePrice")
    lr = LinearRegression()
    
    if k < 1:
        raise ValueError('parameter k can only take non-zero positive integers')
    
    if k == 0:
        train = df[:1460]
        test = df[1460:]

        lr.fit(train[features], train["SalePrice"])
        predictions = lr.predict(test[features])
        mse = mean_squared_error(test["SalePrice"], predictions)
        rmse = np.sqrt(mse)

        return rmse
    
    if k == 1:
        # Randomize *all* rows (frac=1) from `df` and return
        shuffled_df = df.sample(frac=1, )
        train = df[:1460]
        test = df[1460:]
        
        lr.fit(train[features], train["SalePrice"])
        predictions_one = lr.predict(test[features])        
        
        mse_one = mean_squared_error(test["SalePrice"], predictions_one)
        rmse_one = np.sqrt(mse_one)
        
        lr.fit(test[features], test["SalePrice"])
        predictions_two = lr.predict(train[features])        
       
        mse_two = mean_squared_error(train["SalePrice"], predictions_two)
        rmse_two = np.sqrt(mse_two)
        
        avg_rmse = np.mean([rmse_one, rmse_two])
        print(rmse_one)
        print(rmse_two)
        return avg_rmse
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        for train_index, test_index, in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            lr.fit(train[features], train["SalePrice"])
            predictions = lr.predict(test[features])
            mse = mean_squared_error(test["SalePrice"], predictions)
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
        print(rmse_values)
        avg_rmse = np.mean(rmse_values)
        return avg_rmse

# Conclusion

With all 3 functions created for the pipeline, we can put them to use with the dataset, which is as follows:

In [40]:
transformed_df = transform_features(data)
selected_df = select_features(transformed_df)
rmse = train_and_test(selected_df, k=20)
rmse

[37166.19311978606, 25817.606110268895, 27110.59703648638, 54730.665840547, 25938.311728749475, 39544.94026158685, 29892.577885212704, 21462.328067954135, 24390.99889098194, 28175.434587323845, 23825.08998285168, 27048.781307713158, 29954.620541176482, 21644.842547397086, 26508.226927559084, 32194.971893092195, 25810.964487721085, 56369.59874995934, 19402.89370046144, 24614.852038269775]


30080.22478525493