# Introduction

We will be working with housing data for the city of Ames, Iowa, United States from 2006 to 2010.

The data set could be found [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627) and full data description could be found [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).

In [1]:
# import necessary modules and read in data set
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

with open('AmesHousing.tsv') as f:
    df = pd.read_csv(f, delimiter='\t')

In [2]:
def transform_features(df):
    return df

def selected_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):

    # split train and test data sets
    train = df[:1460]
    test = df[1460:]
    
    # selecting only numerical columns
    numeric_train = train.select_dtypes(include=['int', 'float'])
    numeric_test = test.select_dtypes(include=['int', 'float'])
        
    # selecting features for model
    features = numeric_train.columns.drop('SalePrice')
    
    # instantiate the model class
    lr = LinearRegression()
    
    # train model using all numerical columns
    lr.fit(train[features], train['SalePrice'])
    
    # test model on test set
    predictions = lr.predict(test[features])
    
    # calculate and return RMSE of model
    mse = mean_squared_error(test['SalePrice'], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

transformed_df = transform_features(df)
filtered_df = selected_features(transformed_df)
rmse = train_and_test(filtered_df)

rmse

57088.25161263909

## Feature Engineering

We will now define a function to clean and prepare the data before we feed it into the model. The general goal of this fuction is to:
- remove features that we don't want to use in the model, just based on the number of missing values (25% threshold) or data leakage
- transform features into proper format (numerical to categorical, scaling numerical, filling in missing values, ect)
- create new features by combining other features

We will handle missing values by:
- all columns:
    - dropping columns with more than 5% missing values **for now**
- text columns:
    - dropping columns with 1 or more missing values **for now**
- numeric columns:
    - filling in missing values with column mode **for now**

In [3]:
# creating series that counts sum of missing data in each column
num_missing = df.isnull().sum()

# creating filter to find columns with >5% missing values
dropped_missing_cols = num_missing[(num_missing > len(df)/20)]

# dropping the columns from data
df = df.drop(dropped_missing_cols.index, axis=1)

In [4]:
# creating series to identify text columns with missing values
text_cols = df.select_dtypes(include='object').isnull().sum()

# creating filter to identify columns with missing values
text_missing_cols = text_cols[(text_cols >= 1)]

# dropping columns that have 1 or more missing data
df = df.drop(text_missing_cols.index, axis=1)

In [5]:
# identifying numeric columns to be filled-in
num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
fixable_numeric_cols

BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Garage Cars        1
Garage Area        1
Bsmt Full Bath     2
Bsmt Half Bath     2
Mas Vnr Area      23
dtype: int64

In [6]:
# calculating modes for missing values replacements
replacement_value_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
replacement_value_dict

{'Bsmt Full Bath': 0.0,
 'Bsmt Half Bath': 0.0,
 'Bsmt Unf SF': 0.0,
 'BsmtFin SF 1': 0.0,
 'BsmtFin SF 2': 0.0,
 'Garage Area': 0.0,
 'Garage Cars': 2.0,
 'Mas Vnr Area': 0.0,
 'Total Bsmt SF': 0.0}

In [7]:
# filling missing values
df = df.fillna(replacement_value_dict)

In [8]:
# verifying no more missing values
df.isnull().sum().value_counts()

0    64
dtype: int64

Now, we will create some new features that better capture the information in some of the exisiting features.

In [9]:
years_sold = df["Yr Sold"] - df["Year Built"]
years_sold[years_sold < 0]

2180   -1
dtype: int64

In [10]:
years_since_remod = df["Yr Sold"] - df["Year Remod/Add"]
years_since_remod[years_since_remod < 0]

1702   -1
2180   -2
2181   -1
dtype: int64

In [11]:
# creating new features
df['Years Before Sale'] = years_sold
df['Years Since Remod'] = years_since_remod

# dropping rows with negative value for new features
df = df.drop([1702, 2180, 2181], axis=0)

# no longer needing original columns, so remove to avoid collinearity
df = df.drop(['Year Built', 'Year Remod/Add'], axis=1)

We are also going to drop columns due to reasonings as follow:
- columns that aren't useful for machine learning
- columns that leak data about the final sale

In [12]:
# dropping columns that aren't useful for machine learning
df = df.drop(['PID', 'Order'], axis=1)

# dropping columns that leak data about final sale
df = df.drop(['Yr Sold', 'Mo Sold', 'Sale Condition', 'Sale Type'], axis=1)

Compiling the steps taken for feature engineering, we will update the 'transform_features' function.

In [13]:
def transform_features(df):
    num_missing = df.isnull().sum()
    dropped_missing_cols = num_missing[(num_missing > len(df)/20)]
    df = df.drop(dropped_missing_cols.index, axis=1)
    
    text_cols = df.select_dtypes(include='object').isnull().sum()
    text_missing_cols = text_cols[(text_cols >= 1)]
    df = df.drop(text_missing_cols.index, axis=1)
    
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_value_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_value_dict)
    
    years_sold = df["Yr Sold"] - df["Year Built"]
    years_since_remod = df["Yr Sold"] - df["Year Remod/Add"]
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod

    df = df.drop([1702, 2180, 2181], axis=0)
    df = df.drop(['Year Built', 'Year Remod/Add', 'PID', 'Order', 'Yr Sold', 'Mo Sold', 'Sale Condition', 'Sale Type'], axis=1)
    
    return df

def selected_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):

    # split train and test data sets
    train = df[:1460]
    test = df[1460:]
    
    # selecting only numerical columns
    numeric_train = train.select_dtypes(include=['int', 'float'])
    numeric_test = test.select_dtypes(include=['int', 'float'])
        
    # selecting features for model
    features = numeric_train.columns.drop('SalePrice')
    
    # instantiate the model class
    lr = LinearRegression()
    
    # train model using all numerical columns
    lr.fit(train[features], train['SalePrice'])
    
    # test model on test set
    predictions = lr.predict(test[features])
    
    # calculate and return RMSE of model
    mse = mean_squared_error(test['SalePrice'], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

df = pd.read_csv('AmesHousing.tsv', delimiter='\t') # note that dataset has to be read in again
transformed_df = transform_features(df)
filtered_df = selected_features(transformed_df)
rmse = train_and_test(filtered_df)

rmse

55275.36731241307

## Feature Selection

After cleaning and transforming many features in the data set, we will now try to improve the model by selecting specific numerical features.

In [14]:
numerical_df = transformed_df.select_dtypes(include=['int', 'float'])
numerical_df.head()

Unnamed: 0,MS SubClass,Lot Area,Overall Qual,Overall Cond,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice,Years Before Sale,Years Since Remod
0,20,31770,6,5,112.0,639.0,0.0,441.0,1080.0,1656,...,210,62,0,0,0,0,0,215000,50,50
1,20,11622,5,6,0.0,468.0,144.0,270.0,882.0,896,...,140,0,0,0,120,0,0,105000,49,49
2,20,14267,6,6,108.0,923.0,0.0,406.0,1329.0,1329,...,393,36,0,0,0,0,12500,172000,52,52
3,20,11160,7,5,0.0,1065.0,0.0,1045.0,2110.0,2110,...,0,0,0,0,0,0,0,244000,42,42
4,60,13830,5,5,0.0,791.0,0.0,137.0,928.0,928,...,212,34,0,0,0,0,0,189900,13,12


In [15]:
# generate correlation heatmap matrix of numerical features in the training data set
abs_corr_coeff = numerical_df.corr()['SalePrice'].abs().sort_values()
abs_corr_coeff

BsmtFin SF 2         0.006127
Misc Val             0.019273
3Ssn Porch           0.032268
Bsmt Half Bath       0.035875
Low Qual Fin SF      0.037629
Pool Area            0.068438
MS SubClass          0.085128
Overall Cond         0.101540
Screen Porch         0.112280
Kitchen AbvGr        0.119760
Enclosed Porch       0.128685
Bedroom AbvGr        0.143916
Bsmt Unf SF          0.182751
Lot Area             0.267520
2nd Flr SF           0.269601
Bsmt Full Bath       0.276258
Half Bath            0.284871
Open Porch SF        0.316262
Wood Deck SF         0.328183
BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: Sale

We set arbitary threshold of 0.4 as our minimum correlation threshold, and select only features that have correlation above 0.4.

In [16]:
abs_corr_coeff[abs_corr_coeff > 0.4]

BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: SalePrice, dtype: float64

In [17]:
transformed_df = transformed_df.drop(abs_corr_coeff[abs_corr_coeff < 0.4].index, axis=1)

We will now turn our attention to the categorial data we have yet to pick out and process from the data set. All columns that can be categorized as nominal variables are possible candidates, but there are some other things to think about:
- if a categorial column has too many unique values, conversion via dummy method will result in addition of large quantity of dummies to model, which might not be ideal.
- if a categorical column have majority of its values in one category (eg. >95%), it will cause issues similar to numerical data with low variance, in that there will be little to no variability in the data for the model to capture.

In [18]:
# from the documentation, pick out the nominal varibles
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

In [19]:
# select categorical candidates from the transformed data set after correlation test
transform_cat_cols = []
for cols in nominal_features:
    if cols in transformed_df.columns:
        transform_cat_cols.append(cols)
        
# check number of categories each candidate include
uniqueness_count = transformed_df[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()
uniqueness_count

Street           2
Central Air      2
Land Contour     4
Lot Config       5
Bldg Type        5
Roof Style       6
Foundation       6
Heating          6
MS Zoning        7
Condition 2      8
House Style      8
Roof Matl        8
Condition 1      9
Exterior 1st    16
Exterior 2nd    17
Neighborhood    28
dtype: int64

In [20]:
# select arbitary cutoff at 10, drop features that have too many features
drop_hetero_cols = uniqueness_count[uniqueness_count > 10].index
transformed_df = transformed_df.drop(drop_hetero_cols, axis=1)

In [21]:
remaining_cat = []
for cols in nominal_features:
    if cols in transformed_df.columns:
        remaining_cat.append(cols)
transformed_df[remaining_cat].dtypes

MS Zoning       object
Street          object
Land Contour    object
Lot Config      object
Condition 1     object
Condition 2     object
Bldg Type       object
House Style     object
Roof Style      object
Roof Matl       object
Foundation      object
Heating         object
Central Air     object
dtype: object

In [22]:
remaining_text = []
for cols in remaining_cat:
    if cols in transformed_df.select_dtypes(include='object').columns:
        remaining_text.append(cols)
remaining_text

['MS Zoning',
 'Street',
 'Land Contour',
 'Lot Config',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Foundation',
 'Heating',
 'Central Air']

In [23]:
# convert remaining text columns into categorical dtypes and add to transformed_df
for col in remaining_text:
    transformed_df[col] = transformed_df[col].astype('category')
    
# get dummies for categorical columns in the main dataframe
transformed_df = pd.concat([
    transformed_df,
    pd.get_dummies(transformed_df.select_dtypes(include=['category']))
], axis=1).drop(remaining_text, axis=1)

Update selected_features()

In [24]:
def transform_features(df):
    num_missing = df.isnull().sum()
    dropped_missing_cols = num_missing[(num_missing > len(df)/20)]
    df = df.drop(dropped_missing_cols.index, axis=1)
    
    text_cols = df.select_dtypes(include='object').isnull().sum()
    text_missing_cols = text_cols[(text_cols >= 1)]
    df = df.drop(text_missing_cols.index, axis=1)
    
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_value_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_value_dict)
    
    years_sold = df["Yr Sold"] - df["Year Built"]
    years_since_remod = df["Yr Sold"] - df["Year Remod/Add"]
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod

    df = df.drop([1702, 2180, 2181], axis=0)
    df = df.drop(['Year Built', 'Year Remod/Add', 'PID', 'Order', 'Yr Sold', 'Mo Sold', 'Sale Condition', 'Sale Type'], axis=1)
    
    return df

def selected_features(df, corr_coeff_threshold=0.4, uniqueness_threshold=10):
    numerical_df = df.select_dtypes(include=['int', 'float'])
    abs_corr_coeff = numerical_df.corr()['SalePrice'].abs().sort_values()
    df = df.drop(abs_corr_coeff[abs_corr_coeff < corr_coeff_threshold].index, axis=1)
    
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    transform_cat_cols = []
    for cols in nominal_features:
        if cols in df.columns:
            transform_cat_cols.append(cols)
        
    uniqueness_count = df[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()
    drop_hetero_cols = uniqueness_count[uniqueness_count > uniqueness_threshold].index
    df = df.drop(drop_hetero_cols, axis=1)
    
    remaining_text = df.select_dtypes(include='object')
    for col in remaining_text:
        df[col] = df[col].astype('category')
    
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(remaining_text, axis=1)
    return df

def train_and_test(df):

    # split train and test data sets
    train = df[:1460]
    test = df[1460:]
    
    # selecting only numerical columns
    numeric_train = train.select_dtypes(include=['int', 'float'])
    numeric_test = test.select_dtypes(include=['int', 'float'])
        
    # selecting features for model
    features = numeric_train.columns.drop('SalePrice')
    
    # instantiate the model class
    lr = LinearRegression()
    
    # train model using all numerical columns
    lr.fit(train[features], train['SalePrice'])
    
    # test model on test set
    predictions = lr.predict(test[features])
    
    # calculate and return RMSE of model
    mse = mean_squared_error(test['SalePrice'], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

df = pd.read_csv('AmesHousing.tsv', delimiter='\t') # note that dataset has to be read in again
transform_df = transform_features(df)
filtered_df = selected_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

36623.53562910476

We find that rmse with modified feature selection fuction is 36623.54, which is an improvement from the previous model (feature engineering only) with rmse 55275.37.

## Train and Test

Now we will explore the difference to resulting rmse by implementing k-fold technique during model testing. We will be investigating changes to rmse with different number of folds (k-value).

We will be using the dataframe that has been processed via feature engineering and selection.

In [25]:
filtered_df.shape

(2927, 130)

In [26]:
filtered_df.dtypes.value_counts()

uint8      116
int64        9
float64      5
dtype: int64

In [27]:
# training and testing the data when k=0 (ie. holdout validation)

numeric_filtered = filtered_df.select_dtypes(include=["float", "int"])

train = numeric_filtered[:1460]
test = numeric_filtered[1460:]

features = train.columns.drop("SalePrice")

lr = LinearRegression()

lr.fit(train[features], train["SalePrice"])
predictions = lr.predict(test[features])
mse = mean_squared_error(test["SalePrice"], predictions)
rmse = np.sqrt(mse)
rmse

36623.53562910476

In [28]:
# set k=1 and perform simple cross validation
# Randomize *all* rows (frac=1) from `df` and return
shuffled_df = filtered_df.sample(frac=1, )

numeric_filtered = shuffled_df.select_dtypes(include=["float", "int"])

fold_one = numeric_filtered[:1460]
fold_two = numeric_filtered[1460:]

lr = LinearRegression()
lr.fit(fold_one[features], fold_one["SalePrice"])
predictions_one = lr.predict(fold_two[features])
mse_one = mean_squared_error(fold_two["SalePrice"], predictions_one)
rmse_one = np.sqrt(mse_one)

lr.fit(fold_two[features], fold_two["SalePrice"])
predictions_two = lr.predict(fold_one[features])
mse_two = mean_squared_error(fold_one["SalePrice"], predictions_two)
rmse_two = np.sqrt(mse_two)

average_rmse = np.mean([rmse_one, rmse_two])
average_rmse

34208.65606772392

In [29]:
# for k greather than 1, we will implement KFold cross validation
from sklearn.model_selection import KFold

numeric_filtered = filtered_df.select_dtypes(include=["float","int"])

kf = KFold(n_splits=4, shuffle=True)
# choose arbitary number of folds k=4
rmse_values = []
for train_index, test_index in kf.split(numeric_filtered):
    train = numeric_filtered.iloc[train_index]
    test = numeric_filtered.iloc[test_index]
    lr.fit(train[features], train["SalePrice"])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    rmse_values.append(rmse)
avg_rmse = np.mean(rmse_values)
avg_rmse

32850.43443848478

Update train_and_test()

In [30]:
def transform_features(df):
    num_missing = df.isnull().sum()
    dropped_missing_cols = num_missing[(num_missing > len(df)/20)]
    df = df.drop(dropped_missing_cols.index, axis=1)
    
    text_cols = df.select_dtypes(include='object').isnull().sum()
    text_missing_cols = text_cols[(text_cols >= 1)]
    df = df.drop(text_missing_cols.index, axis=1)
    
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_value_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_value_dict)
    
    years_sold = df["Yr Sold"] - df["Year Built"]
    years_since_remod = df["Yr Sold"] - df["Year Remod/Add"]
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod

    df = df.drop([1702, 2180, 2181], axis=0)
    df = df.drop(['Year Built', 'Year Remod/Add', 'PID', 'Order', 'Yr Sold', 'Mo Sold', 'Sale Condition', 'Sale Type'], axis=1)
    
    return df

def selected_features(df, corr_coeff_threshold=0.4, uniqueness_threshold=10):
    numerical_df = df.select_dtypes(include=['int', 'float'])
    abs_corr_coeff = numerical_df.corr()['SalePrice'].abs().sort_values()
    df = df.drop(abs_corr_coeff[abs_corr_coeff < corr_coeff_threshold].index, axis=1)
    
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    transform_cat_cols = []
    for cols in nominal_features:
        if cols in df.columns:
            transform_cat_cols.append(cols)
        
    uniqueness_count = df[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()
    drop_hetero_cols = uniqueness_count[uniqueness_count > uniqueness_threshold].index
    df = df.drop(drop_hetero_cols, axis=1)
    
    remaining_text = df.select_dtypes(include='object')
    for col in remaining_text:
        df[col] = df[col].astype('category')
    
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(remaining_text, axis=1)
    return df

def train_and_test(df, k=4):
    # selecting only numerical columns
    numeric_df = df.select_dtypes(include=["int", "float"])
    # instantiate the model class
    lr = LinearRegression()
    features = numeric_df.columns.drop("SalePrice")
    
    if k==0:
        train = numeric_df[:1460]
        test = numeric_df[1460:]
        
        lr.fit(train[features], train["SalePrice"])
        predictions = lr.predict(test[features])
        mse = mean_squared_error(test["SalePrice"], predictions)
        rmse = np.sqrt(mse)
        
        return rmse

    if k==1:
        shuffled_df = numeric_df.sample(frac=1, )

        fold_one = shuffled_df[:1460]
        fold_two = shuffled_df[1460:]
        
        lr.fit(fold_one[features], fold_one["SalePrice"])
        predictions_one = lr.predict(fold_two[features])
        mse_one = mean_squared_error(fold_two["SalePrice"], predictions_one)
        rmse_one = np.sqrt(mse_one)

        lr.fit(fold_two[features], fold_two["SalePrice"])
        predictions_two = lr.predict(fold_one[features])
        mse_two = mean_squared_error(fold_one["SalePrice"], predictions_two)
        rmse_two = np.sqrt(mse_two)
        
        average_rmse = np.mean([rmse_one, rmse_two])
        return average_rmse
    
    else:
        from sklearn.model_selection import KFold

        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        for train_index, test_index in kf.split(numeric_df):
            train = numeric_filtered.iloc[train_index]
            test = numeric_filtered.iloc[test_index]
            
            lr.fit(train[features], train["SalePrice"])
            predictions = lr.predict(test[features])
            mse = mean_squared_error(test["SalePrice"], predictions)
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
        avg_rmse = np.mean(rmse_values)
        return avg_rmse

df = pd.read_csv('AmesHousing.tsv', delimiter='\t') # note that dataset has to be read in again
transform_df = transform_features(df)
filtered_df = selected_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

33396.97370534905

## Conclusion

We constructed multiple linear regression models to estimate house sale prices.

As we fine tune the specifications of the model, we find that we were able to keep reducing root mean square error, which we took as a measure to the accuracy of the model.

As linear models are only able to take in numerical variables, we focused on these variables to perform transformations that aims to reduce bias and error introduced to the model. For text variables, we only looked into categorical variables, and opted only to keep the ones that proved to provide enough homogeneity to be useful for analysis.

Then we looked at improving the model via hyperparameter optimisation. In this instance, we choose to conduct cross validation with the K-Fold technique.