# Predicting House Sale Prices

## Introduction

In this project, we'll work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. Information about the data collection can be found [here](https://doi.org/10.1080/10691898.2011.11889627), and information about the different columns in the data [here](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt).

Let's start by setting up a pipeline of functions that will let us quickly iterate on different models.

In [1]:
# Import the libraries
import pandas as pd
pd.options.display.max_columns = 999 # to avoid displaying truncated output
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

In [2]:
# Read in the data as DataFrame
df = pd.read_csv('AmesHousing.tsv', delimiter='\t')

In [3]:
def transform_features(df):
    return df

def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):
    train = df[0:1460]
    test = df[1460:]
    
    # Select only numerical columns
    numeric_train = train.select_dtypes(include=['int', 'float'])
    numeric_test = test.select_dtypes(include=['int', 'float'])
    
    # Assign features and target columns
    features = numeric_train.columns.drop('SalePrice')
    target = 'SalePrice'
    
    lr = LinearRegression()
    lr.fit(train[features], train[target])
    predictions = lr.predict(test[features])
    
    mse = mean_squared_error(test[target], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

In [4]:
# Run our initial functions
transformed_df = transform_features(df)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)
rmse

57088.25161263909

## Feature Engineering

Let's now start removing features with many missing values, diving deeper into potential categorical features, and transforming text and numerical columns.

We'll handle missing values as follows:

1. `All columns`: Drop any with 5% or more missing values **for now**.
2. `Text columns`: Drop any with 1 or more missing values **for now**.
3. `Numerical columns`: Fill in the missing values using the most popular value for that column.

**Step 1:** Drop `all columns` with 5% or more missing values **for now**.

In [5]:
# Total number of missing values for each column
all_mv_counts = df.isnull().sum()

# Columns containing >5% missing values
drop_missing_cols = all_mv_counts[all_mv_counts > len(df)*0.05]

# Drop those columns from the df DataFrame
df = df.drop(drop_missing_cols.index, axis=1)

**Step 2:** Drop any `text column` with 1 or more missing values **for now**.

In [6]:
# Total number of missing values for text columns
text_mv_counts = df.select_dtypes(include=['object']).isnull().sum()

# Text columns containing any missing values
drop_missing_cols_2 = text_mv_counts[text_mv_counts > 0]

# Drop those columns from the df DataFrame
df = df.drop(drop_missing_cols_2.index, axis=1)

**Step 3:** Fill in the missing values in the `numerical columns` using the most popular value for that column.

In [7]:
# Total number of missing values for numerical columns
numeric_mv_counts = df.select_dtypes(include=['int', 'float']).isnull().sum()

# Numerical columns containing any missing values
fixable_numeric_cols = numeric_mv_counts[numeric_mv_counts > 0]

# Fill in 'NaN' values with the most common value for each numerical column
df = df.fillna(df[fixable_numeric_cols.index].mode().iloc[0])

In [8]:
# Check that every column has 0 missing values
df.isnull().sum().value_counts()

0    64
dtype: int64

The columns `Year Built` and `Year Remod/Add` represent the year of construction and the year of remodelling, respectively.


We'll use these columns to create new features to show how many years have passed since the construction and remodelling.
The data of these two new columns should be *positive* values.

In [9]:
years_sold = df['Yr Sold'] - df['Year Built']
years_sold[years_sold < 0]

2180   -1
dtype: int64

In [10]:
years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
years_since_remod[years_since_remod < 0]

1702   -1
2180   -2
2181   -1
dtype: int64

Let's create the new features and drop the rows with negative values.

In [11]:
# Create new columns
df['Years Before Sale'] = years_sold
df['Years Since Remod'] = years_since_remod

# Drop rows with negative values for both these new columns
df = df.drop([1702, 2180, 2181], axis=0)

# Drop the columns used to create the new columns. No longer need them.
df = df.drop(['Year Built', 'Year Remod/Add'], axis=1)

We'll also remove columns that aren't useful for *Machine Learning*, and columns that leak data about the final sale.

In [12]:
# Drop columns that aren't useful for ML
df = df.drop(['PID', 'Order'], axis=1)

# Drop columns that leak info about the final sale
df = df.drop(['Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'], axis=1)

Now let's update `transform_features()` function.

In [13]:
def transform_features(df):
    
    # Drop any column with 5% or more missing values
    all_mv_counts = df.isnull().sum()
    drop_missing_cols = all_mv_counts[all_mv_counts > len(df)*0.05]
    df = df.drop(drop_missing_cols.index, axis=1)
    
    # Drop any text columns with 1 or more missing values
    text_mv_counts = df.select_dtypes(include=['object']).isnull().sum()
    drop_missing_cols_2 = text_mv_counts[text_mv_counts > 0]
    df = df.drop(drop_missing_cols_2.index, axis=1)
    
    # Fill in 'NaN' values with the most common value for each numerical column
    numeric_mv_counts = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = numeric_mv_counts[numeric_mv_counts > 0]
    df = df.fillna(df[fixable_numeric_cols.index].mode().iloc[0])
    
    # Create new features and drop rows with negative values for both columns
    df['Years Before Sale'] = df['Yr Sold'] - df['Year Built']
    df['Years Since Remod'] = df['Yr Sold'] - df['Year Remod/Add']
    df = df.drop([1702, 2180, 2181], axis=0)
    
    # Drop columns no longer needed, not useful for ML or that leak info about the final sale
    df = df.drop(['PID', 'Order', 'Mo Sold', 'Sale Condition', 'Sale Type', 'Year Built', 'Year Remod/Add'], axis=1)
    
    return df

In [14]:
# Run our functions after updating 'transform_features' function
df = pd.read_csv("AmesHousing.tsv", delimiter="\t")

transformed_df = transform_features(df)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df)
rmse

55275.36731241307

## Feature Selection

Now that we have cleaned and transformed a lot of the features in the dataset, we'll move on to feature selection for numerical features.

In [15]:
# Select numerical columns only
numerical_df = transformed_df.select_dtypes(include=['int', 'float'])
numerical_df.head()

Unnamed: 0,MS SubClass,Lot Area,Overall Qual,Overall Cond,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Yr Sold,SalePrice,Years Before Sale,Years Since Remod
0,20,31770,6,5,112.0,639.0,0.0,441.0,1080.0,1656,0,0,1656,1.0,0.0,1,0,3,1,7,2,2.0,528.0,210,62,0,0,0,0,0,2010,215000,50,50
1,20,11622,5,6,0.0,468.0,144.0,270.0,882.0,896,0,0,896,0.0,0.0,1,0,2,1,5,0,1.0,730.0,140,0,0,0,120,0,0,2010,105000,49,49
2,20,14267,6,6,108.0,923.0,0.0,406.0,1329.0,1329,0,0,1329,0.0,0.0,1,1,3,1,6,0,1.0,312.0,393,36,0,0,0,0,12500,2010,172000,52,52
3,20,11160,7,5,0.0,1065.0,0.0,1045.0,2110.0,2110,0,0,2110,1.0,0.0,2,1,3,1,8,2,2.0,522.0,0,0,0,0,0,0,0,2010,244000,42,42
4,60,13830,5,5,0.0,791.0,0.0,137.0,928.0,928,701,0,1629,0.0,0.0,2,1,3,1,6,1,2.0,482.0,212,34,0,0,0,0,0,2010,189900,13,12


In [16]:
# Compute absolute values of correlation coefficients between features and target column
abs_corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values()
abs_corr_coeffs

BsmtFin SF 2         0.006127
Misc Val             0.019273
Yr Sold              0.030358
3Ssn Porch           0.032268
Bsmt Half Bath       0.035875
Low Qual Fin SF      0.037629
Pool Area            0.068438
MS SubClass          0.085128
Overall Cond         0.101540
Screen Porch         0.112280
Kitchen AbvGr        0.119760
Enclosed Porch       0.128685
Bedroom AbvGr        0.143916
Bsmt Unf SF          0.182751
Lot Area             0.267520
2nd Flr SF           0.269601
Bsmt Full Bath       0.276258
Half Bath            0.284871
Open Porch SF        0.316262
Wood Deck SF         0.328183
BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice 

In [17]:
# Keep only columns with a correlation coefficient above 0.4
abs_corr_coeffs[abs_corr_coeffs > 0.4]

BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: SalePrice, dtype: float64

In [18]:
# Drop columns with a correlation coefficient below 0.4
transformed_df = transformed_df.drop(abs_corr_coeffs[abs_corr_coeffs < 0.4].index, axis=1)

Columns that can be categorised as nominal variables are candidates for being converted to categorical. Let's check the [documentation](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) and look for nominal columns.

In [19]:
# Nominal columns to be converted to categorical 
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

Next, let's find which columns are currently numerical but need to be encoded as categorical instead.

We'll also check how many unique values we have in each categorical column and keep those features with up to 10 unique values.

In [20]:
# Numerical columns that need to be enconded as categorical
transform_cat_cols = []
for col in nominal_features:
    if col in transformed_df.columns:
        transform_cat_cols.append(col)
        
# Check how many unique values in each categorical column
uniqueness_counts = transformed_df[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()

# Keep categorical columns with up to 10 unique values
drop_cols = uniqueness_counts[uniqueness_counts > 10].index
transformed_df = transformed_df.drop(drop_cols, axis=1)

Let's select just the remaining text columns, convert them to categorical columns, dummy code these columns and add them back to the dataframe.

In [21]:
# Select text columns and convert to categorical
text_cols = transformed_df.select_dtypes(include=['object'])
for col in text_cols:
    transformed_df[col] = transformed_df[col].astype('category')
    
# Create dummy columns and add them back to the data frame
dummy_cols = pd.get_dummies(transformed_df.select_dtypes(include=['category']))
transformed_df = pd.concat([transformed_df, dummy_cols], axis=1)

# Drop the original text columns from the data frame
transformed_df = transformed_df.drop(text_cols, axis=1)

Now let's update `select_features()` function.

In [22]:
def select_features(df, coeff_threshold=0.4, uniq_threshold=10):
    
    # Select numerical columns only
    # Compute absolute values of correlation coefficients between features and target column
    # Drop columns with a correlation coefficient below 'coeff_threshold'
    numerical_df = df.select_dtypes(include=['int', 'float'])
    abs_corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values()
    df = df.drop(abs_corr_coeffs[abs_corr_coeffs < coeff_threshold].index, axis=1)
    
    # Nominal columns to be converted to categorical 
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    
    # Numerical columns that need to be enconded as categorical
    transform_cat_cols = []
    for col in nominal_features:
        if col in df.columns:
            transform_cat_cols.append(col)
        
    # Check how many unique values in each categorical column
    uniqueness_counts = df[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()

    # Keep categorical columns with up to 'uniq_threshold' unique values
    drop_cols = uniqueness_counts[uniqueness_counts > uniq_threshold].index
    df = df.drop(drop_cols, axis=1)
    
    # Select text columns and convert to categorical
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')
    
    # Create dummy columns and add them back to the data frame
    dummy_cols = pd.get_dummies(df.select_dtypes(include=['category']))
    df = pd.concat([df, dummy_cols], axis=1)

    # Drop the original text columns from the data frame
    df = df.drop(text_cols, axis=1)
    
    return df

## Train and Test

Now for the final part of the pipeline, training and testing.

Let's add a parameter named **`k`** that controls the type of cross-validation that occurs.

- When **`k`** equals **`0`**, perform *holdout validation*.
- When **`k`** equals **`1`**, perform *simple cross-validation*.
- When **`k`** is greater than **`1`**, implement *k-fold cross-validation* using **`k`** folds.

In [23]:
def train_and_test(df, k=0):
    numeric_df = df.select_dtypes(include=['int', 'float'])
    features = numeric_df.columns.drop("SalePrice")
    target = 'SalePrice'
    lr = LinearRegression()
    
    if k == 0:
        train = df[0:1460]
        test = df[1460:]

        lr.fit(train[features], train[target])
        predictions = lr.predict(test[features])
        mse = mean_squared_error(test[target], predictions)
        rmse = np.sqrt(mse)

        return rmse
    
    if k == 1:
        # Randomize all rows (frac=1) from 'df' and return it
        shuffled_df = df.sample(frac=1)
        train = df[0:1460]
        test = df[1460:]
        
        lr.fit(train[features], train[target])
        predictions_one = lr.predict(test[features])        
        
        mse_one = mean_squared_error(test[target], predictions_one)
        rmse_one = np.sqrt(mse_one)
        
        lr.fit(test[features], test[target])
        predictions_two = lr.predict(train[features])        
       
        mse_two = mean_squared_error(train[target], predictions_two)
        rmse_two = np.sqrt(mse_two)
        
        avg_rmse = np.mean([rmse_one, rmse_two])
        print(rmse_one)
        print(rmse_two)
        return avg_rmse
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        for train_index, test_index, in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            lr.fit(train[features], train["SalePrice"])
            predictions = lr.predict(test[features])
            mse = mean_squared_error(test["SalePrice"], predictions)
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
        print(rmse_values)
        avg_rmse = np.mean(rmse_values)
        return avg_rmse

Let's run our functions after all updates we've made using **`k`**=0, **`k`**=1 and **`k`**=4.

In [24]:
df = pd.read_csv("AmesHousing.tsv", delimiter="\t")

# k=0
transformed_df = transform_features(df)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df, k=0)

rmse

36623.53562910476

In [25]:
# k=1
transformed_df = transform_features(df)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df, k=1)

rmse

36623.53562910476
30924.751476605


33774.143552854875

In [26]:
# k=4
transformed_df = transform_features(df)
filtered_df = select_features(transformed_df)
rmse = train_and_test(filtered_df, k=4)

rmse

[30953.58447600925, 40132.48026274354, 29296.56101502131, 31050.756002694994]


32858.34543911727

## Conclusion

In this project, we used **Feature Engineering** and **Feature Selection** techniques to create and select the appropriate features for the Linear Regression Model. 

Then the model was evaluated using:
- holdout validation
- simple cross-validation
- k-fold cross-validation

The best result was obtained using *k-fold cross-validation* with **k** = 4.