# Predicting house sale prices

In this project, we will work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. You can read more about why the data was collected [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627). You can also read about the different columns in the data [here](https://s3.amazonaws.com/dq-content/307/data_description.txt). You can visit the [Kaggle kernels page](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels) for this dataset and to see approaches others took.

Let's start by setting up a pipeline of functions that will let us quickly iterate on different models.

# Import the data and builidng a pipeline of functions

In [1]:
import pandas as pd
pd.options.display.max_columns = 1000
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error


housing = pd.read_csv('AmesHousing.txt', delimiter='\t')
housing.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,TA,TA,CBlock,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,GasA,Fa,Y,SBrkr,1656,0,0,1656,1.0,0.0,1,0,3,1,TA,7,Typ,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,1968,1968,Hip,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,GasA,Ex,Y,SBrkr,2110,0,0,2110,1.0,0.0,2,1,3,1,Ex,8,Typ,2,TA,Attchd,1968.0,Fin,2.0,522.0,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,GasA,Gd,Y,SBrkr,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [2]:
def transform_features(df):
    return df

def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):
    
    train = df.iloc[:1460]
    test = df.iloc[1460:]
    
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    features = numeric_train.columns.drop('SalePrice')
    target = 'SalePrice'
    
    
    lr = LinearRegression()
    lr.fit(numeric_train[features], numeric_train[target])
    predictions = lr.predict(numeric_test[features])
    
    mse = mean_squared_error(test[target], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

transform_df = transform_features(housing)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

57088.25161263909

# Feature engineering

Let's now start removing features with many missing values, diving deeper into potential categorical features, and transforming text and numerical columns. **Update transform_features()** so that any column from the data frame with more than 25% (or another cutoff value) missing values is dropped. 

we will also remove any columns that leak information about the sale (e.g. like the year the sale happened). In general, the goal of this function is to:

- remove features that we don't want to use in the model, just based on the number of missing values or data leakage
- transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc)
- create new features by combining other features

Specifically, to handle missing values:
- All columns:
   - Drop any with 5% or more missing values for now.
- Text columns:
   - Drop any with 1 or more missing values for now.
- Numerical columns:
   - For columns with missing values, fill in with the most common value in that column

### 1. All columns: Drop any with 5% or more missing values for now.

In [3]:
num_missing = housing.isnull().sum()
drop_missing_cols = num_missing[(num_missing > len(housing)/20)].sort_values()
housing.drop(drop_missing_cols.index, axis=1, inplace=True)

### 2. Text columns: Drop any with 1 or more missing values for now.

In [4]:
text_mv_counts = housing.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=True)
drop_missing_cols_2 = text_mv_counts[text_mv_counts>0]
housing.drop(drop_missing_cols_2.index, axis=1, inplace=True)
housing.isnull().sum().sort_values(ascending=False)

Mas Vnr Area       23
Bsmt Full Bath      2
Bsmt Half Bath      2
Total Bsmt SF       1
Garage Area         1
Bsmt Unf SF         1
BsmtFin SF 1        1
Garage Cars         1
BsmtFin SF 2        1
Utilities           0
Lot Config          0
PID                 0
Foundation          0
Exter Cond          0
Exter Qual          0
MS SubClass         0
Exterior 2nd        0
Exterior 1st        0
Roof Matl           0
Roof Style          0
MS Zoning           0
Lot Area            0
Year Remod/Add      0
Year Built          0
Overall Cond        0
Overall Qual        0
House Style         0
Street              0
Condition 2         0
Condition 1         0
                   ..
Bldg Type           0
SalePrice           0
Sale Condition      0
Heating             0
Sale Type           0
Yr Sold             0
Mo Sold             0
Misc Val            0
Pool Area           0
Screen Porch        0
3Ssn Porch          0
Enclosed Porch      0
Open Porch SF       0
Wood Deck SF        0
Paved Driv

### 3. Numerical columns: For columns with missing values, fill in with the most common value in that column

In [5]:
num_missing = housing.select_dtypes(include=['integer', 'float']).isnull().sum()
fixable_numeric_cols = num_missing[(num_missing < len(housing)/20) & (num_missing > 0)].sort_values()
fixable_numeric_cols

BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Garage Cars        1
Garage Area        1
Bsmt Full Bath     2
Bsmt Half Bath     2
Mas Vnr Area      23
dtype: int64

In [6]:
replacement_values_dict = housing[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
replacement_values_dict

{'Bsmt Full Bath': 0.0,
 'Bsmt Half Bath': 0.0,
 'Bsmt Unf SF': 0.0,
 'BsmtFin SF 1': 0.0,
 'BsmtFin SF 2': 0.0,
 'Garage Area': 0.0,
 'Garage Cars': 2.0,
 'Mas Vnr Area': 0.0,
 'Total Bsmt SF': 0.0}

In [7]:
housing = housing.fillna(replacement_values_dict)
housing.isnull().sum().value_counts()

0    64
dtype: int64

### Creat some new features in terms of years

In [8]:
years = [col for col in housing.columns if col.startswith('Y')]
years

['Year Built', 'Year Remod/Add', 'Yr Sold']

In [9]:
housing[years].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 3 columns):
Year Built        2930 non-null int64
Year Remod/Add    2930 non-null int64
Yr Sold           2930 non-null int64
dtypes: int64(3)
memory usage: 68.8 KB


In [10]:
years_sold = housing['Yr Sold'] - housing['Year Built']
years_sold[years_sold < 0]

2180   -1
dtype: int64

In [11]:
years_since_remod = housing['Yr Sold'] - housing['Year Remod/Add']
years_since_remod[years_since_remod < 0]

1702   -1
2180   -2
2181   -1
dtype: int64

In [12]:
# Create new columns
housing['Years Before Sale'] = years_sold
housing['Years Since Remod'] = years_since_remod

# Drop rows with negative values for both of these new features
housing.drop(years_since_remod[years_since_remod < 0].index, axis=0, inplace=True)

# No longer need original year columns
housing.drop(['Year Built', 'Year Remod/Add'], axis=1, inplace=True)

### Drop columns that aren't useful for machine learning or leak data about the final sale

In [13]:
# Drop columns that aren't useful for ML
housing.drop(['Order', 'PID'], axis=1, inplace=True)

# Drop columns that leak info about the final sale
housing.drop(["Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1, inplace=True)

### Update transform_features()

In [14]:
def transform_features(df):
    # All columns: Drop any with 5% or more missing values for now
    num_missing = housing.isnull().sum()
    drop_missing_cols = num_missing[(num_missing > len(df)/20)].sort_values()
    df.drop(drop_missing_cols.index, axis=1, inplace=True)
    
    # Text columns: Drop any with 1 or more missing values for now
    text_mv_counts = df.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=True)
    drop_missing_cols_2 = text_mv_counts[text_mv_counts > 0]
    df.drop(drop_missing_cols_2.index, axis=1, inplace=True)

    # Numerical columns: For columns with missing values, fill in with the most common value in that column
    num_missing = df.select_dtypes(include=['integer', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)
    
    # Date engineering, year related columns
    years = [col for col in df.columns if col.startswith('Y')]
    years_sold = df['Yr Sold'] - df['Year Built']
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']

    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod
    df.drop(years_since_remod[years_since_remod < 0].index, axis=0, inplace=True)

    # No longer need original year columns
    df.drop(['Year Built', 'Year Remod/Add', "Yr Sold", 'Order', 'PID',"Mo Sold", "Sale Condition", "Sale Type"], axis=1, inplace=True)

    return df

def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):
    
    train = df.iloc[:1460]
    test = df.iloc[1460:]
    
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    features = numeric_train.columns.drop('SalePrice')
    target = 'SalePrice'
    
    
    lr = LinearRegression()
    lr.fit(numeric_train[features], numeric_train[target])
    predictions = lr.predict(numeric_test[features])
    
    mse = mean_squared_error(test[target], predictions)
    rmse = np.sqrt(mse)
    
    return rmse


housing = pd.read_csv("AmesHousing.txt", delimiter="\t")
transform_df = transform_features(housing)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

55275.367312413066

# Feature selection

### Numerical columns

In [15]:
numerical_df = transform_df.select_dtypes(include=['integer', 'float']) #use integer rather than int
abs_corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values()

# Let's only keep columns with a correlation coefficient of larger than 0.4 (arbitrary, worth experimenting later!)
abs_corr_coeffs[abs_corr_coeffs > 0.4]

# Drop columns with less than 0.4 correlation with SalePrice
transform_df = transform_df.drop(abs_corr_coeffs[abs_corr_coeffs < 0.4].index, axis=1)

### Categorical columns

We need to decide which categorical columns should we keep?

Points need to be considered:

- Which columns in the data frame should be converted to the categorical data type? All of the columns that can be categorized as nominal variables are candidates for being converted to categorical. 
  - If a categorical column has hundreds of unique values (or categories), should we keep it? When we dummy code this column, hundreds of columns will need to be added back to the data frame.
  - Which categorical columns have a few unique values but more than 95% of the values in the column belong to a specific category? This would be similar to a low variance numerical feature (no variability in the data for the model to capture).


- Which columns are currently numerical but need to be encoded as categorical instead (because the numbers don't have any semantic meaning)?


- What are some ways we can explore which categorical columns "correlate" well with SalePrice?[This post](https://machinelearningmastery.com/feature-selection-machine-learning-python/) presents some potential strategy.


In [16]:
# Create a list of column names from documentation that are *meant* to be categorical
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

# Which categorical columns have we still carried with us? We'll test these
transform_cat_cols = [col for col in nominal_features if col in transform_df.columns]

# How many unique values in each categorical column?
uniqueness_counts = transform_df[transform_cat_cols].apply(lambda col: len(col.value_counts()), axis=0).sort_values()

# Aribtrary cutoff of 10 unique values (worth experimenting)
drop_nonuniq_cols = uniqueness_counts[uniqueness_counts > 10].index
transform_df = transform_df.drop(drop_nonuniq_cols, axis=1)

# Select just the remaining text columns and convert to categorical
text_cols = transform_df.select_dtypes(include=['object'])

for col in text_cols:
    transform_df[col] = transform_df[col].astype('category')
    
# Create dummy columns and add back to the dataframe!
text_dummy = pd.get_dummies(transform_df.select_dtypes(include=['category']))
transform_df = pd.concat([transform_df, text_dummy], axis=1).drop(text_cols, axis=1)
transform_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Columns: 130 entries, Overall Qual to Paved Drive_Y
dtypes: float64(5), int64(9), uint8(116)
memory usage: 674.6 KB


# Update select_features()

In [17]:
def transform_features(df):
    # All columns: Drop any with 5% or more missing values for now
    num_missing = housing.isnull().sum()
    drop_missing_cols = num_missing[(num_missing > len(df)/20)].sort_values()
    df.drop(drop_missing_cols.index, axis=1, inplace=True)
    
    # Text columns: Drop any with 1 or more missing values for now
    text_mv_counts = df.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=True)
    drop_missing_cols_2 = text_mv_counts[text_mv_counts > 0]
    df.drop(drop_missing_cols_2.index, axis=1, inplace=True)

    # Numerical columns: For columns with missing values, fill in with the most common value in that column
    num_missing = df.select_dtypes(include=['integer', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)
    
    # Date engineering, year related columns
    years = [col for col in df.columns if col.startswith('Y')]
    years_sold = df['Yr Sold'] - df['Year Built']
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']

    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod
    df.drop(years_since_remod[years_since_remod < 0].index, axis=0, inplace=True)

    # No longer need original year columns
    df.drop(['Year Built', 'Year Remod/Add', "Yr Sold", 'Order', 'PID',"Mo Sold", "Sale Condition", "Sale Type"], axis=1, inplace=True)

    return df

def select_features(df):
    numeric_df = df.select_dtypes(include=['integer', 'float']) #use integer rather than int
    abs_corr_coeffs = numeric_df.corr()['SalePrice'].abs().sort_values()

    # Let's only keep columns with a correlation coefficient of larger than 0.4 (arbitrary, worth experimenting later!)
    abs_corr_coeffs[abs_corr_coeffs > 0.4]

    # Drop columns with less than 0.4 correlation with SalePrice
    df = df.drop(abs_corr_coeffs[abs_corr_coeffs < 0.4].index, axis=1)

    # Create a list of column names from documentation that are *meant* to be categorical
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                        "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                        "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                        "Misc Feature", "Sale Type", "Sale Condition"]

    # Which categorical columns have we still carried with us? We'll test these
    transform_cat_cols = [col for col in nominal_features if col in df.columns]

    # How many unique values in each categorical column?
    uniqueness_counts = df[transform_cat_cols].apply(lambda col: len(col.value_counts()), axis=0).sort_values()

    # Aribtrary cutoff of 10 unique values (worth experimenting)
    drop_nonuniq_cols = uniqueness_counts[uniqueness_counts > 10].index
    df = df.drop(drop_nonuniq_cols, axis=1)

    # Select just the remaining text columns and convert to categorical
    text_cols = df.select_dtypes(include=['object'])

    for col in text_cols:
        df[col] = df[col].astype('category')

    # Create dummy columns and add back to the dataframe!
    text_dummy = pd.get_dummies(df.select_dtypes(include=['category']))
    transform_df = pd.concat([df, text_dummy], axis=1).drop(text_cols, axis=1)

    return df

def train_and_test(df, k=0):
    

    def train_test(train, test):
        numeric_train = train.select_dtypes(include=['integer', 'float'])
        numeric_test = test.select_dtypes(include=['integer', 'float'])
        features = numeric_train.columns.drop('SalePrice')
        target = 'SalePrice'

        lr = LinearRegression()
        lr.fit(numeric_train[features], numeric_train[target])
        predictions = lr.predict(numeric_test[features])

        mse = mean_squared_error(test[target], predictions)
        rmse = np.sqrt(mse)
        return rmse
        
    if k==0:
        train_0 = df.iloc[:1460]
        test_0 = df.iloc[1460:]
        return train_test(train_0, test_0)
    elif k==1:
        # Randomize *all* rows (frac=1) from `df` and return
        df = df.sample(frac=1, random_state=1)
#         df = df.loc[np.random.permutation(len(df))]
#         fold_one = df.iloc[:1460]
#         fold_two = rand_df.iloc[1460:]
#         rand_df = df.reindex(np.random.permutation(df.index))
#         fold_one = rand_df.iloc[:1460]    
#         fold_two = rand_df.iloc[1460:]
        fold_one = df.iloc[:1460]    
        fold_two = df.iloc[1460:]
        rmse_1 = train_test(fold_one, fold_two)
        rmse_2 = train_test(fold_two, fold_one)
        print(rmse_1, rmse_2)
        
        return np.mean([rmse_1, rmse_2])
    else:
        rmses = []
        kf = KFold(n_splits=k, shuffle=True, random_state=2)
        for train_index, test_index in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            rmses.append(train_test(train, test))
            
        # or
#         kf = KFold(n_splits=k, shuffle=True, random_state=1)
#         lr = LinearRegression()
#         numeric_df = df.select_dtypes(include=['integer', 'float'])
#         features = numeric_df.columns.drop('SalePrice')
#         target = 'SalePrice'
#         mses = cross_val_score(lr, numeric_df[features], numeric_df[target], scoring='neg_mean_squared_error', cv=kf)

#         rmses = np.sqrt(np.absolute(mses))

        return rmses


housing = pd.read_csv("AmesHousing.txt", delimiter="\t")
transform_df = transform_features(housing)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df, k=10)

rmse

[30565.596125460124,
 28392.961750328937,
 31459.467376321532,
 29526.874530256973,
 31764.71234483044,
 32944.912399625806,
 34290.49442708854,
 26540.318527779735,
 31860.664194983714,
 50556.436995797434]