# Kaggle House Prices Challenge

## House Prices: Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

(Link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview)

First, we start by importing the libraries needed for this project.

In [0]:
import pandas as pd
import numpy as np

Loading the data into dataframes.

In [0]:
test = pd.read_csv("test.csv")
train = pd.read_csv("train.csv")

In [3]:
test.shape, train.shape

((1459, 80), (1460, 81))

In [0]:
test.columns, train.columns

(Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
        'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
        'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
        'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
        'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
        'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
        'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
        'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
        'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
        'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
        'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
        'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
        'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
        'GarageCond

We can see that the test dataset contains one more variable compared to the train dataset - which is the "SalePrice" variable. In our analysis / prediction this serves as the dependent variable we want to predict given the houses' characteristics.

In [0]:
train.SalePrice.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

## Data preprocessing and cleaning

We'll start the Data Cleaning by checking if the dependent variable in the test dataset contains any missing values.

In [0]:
train.SalePrice.isnull().sum()

0

All observations contain data for the target variable, therefore we can continue by taking a look at all the other variables contained in the train and test dataset.

In [0]:
miss_count_train = train.isnull().sum().sort_values(ascending=False)
perc_miss_train = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missings_train = pd.concat([miss_count_train, perc_miss_train], axis=1, keys=["Total", "Percent"])

miss_count_test = test.isnull().sum().sort_values(ascending=False)
perc_miss_test = (test.isnull().sum() / test.isnull().count()).sort_values(ascending=False)
missings_test = pd.concat([miss_count_test, perc_miss_test], axis=1, keys=["Total", "Percent"])

In [0]:
missings_train.head(20)

Unnamed: 0,Total,Percent
PoolQC,1453,0.995205
MiscFeature,1406,0.963014
Alley,1369,0.937671
Fence,1179,0.807534
FireplaceQu,690,0.472603
LotFrontage,259,0.177397
GarageCond,81,0.055479
GarageType,81,0.055479
GarageYrBlt,81,0.055479
GarageFinish,81,0.055479


In [0]:
missings_test.head(20)

Unnamed: 0,Total,Percent
PoolQC,1456,0.997944
MiscFeature,1408,0.965045
Alley,1352,0.926662
Fence,1169,0.801234
FireplaceQu,730,0.500343
LotFrontage,227,0.155586
GarageCond,78,0.053461
GarageQual,78,0.053461
GarageYrBlt,78,0.053461
GarageFinish,78,0.053461


As a rule of thumb we completely ignore columns that contain at least 15% missing values and will not try to impute the missing values with any kind of computation, e.g. using means. Therefore, we will delete the variables "PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu" and "LotFrontage".

In [0]:
train = train.drop(columns=["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "LotFrontage"])

The variables "GarageCond", "GarageType", "GarageQual", "GarageYrBlt" and "GarageFinish" contain exactly the same number of missing values, which seems kind of odd. Therefore, we'll take a closer look at these variables.

In [0]:
for var in ["GarageCond", "GarageType", "GarageQual", "GarageYrBlt", "GarageFinish"]:
    print(pd.crosstab(index=train[var], columns="count"))

We can see that for "GarageCond" and "GarageQual" the most frequently occurring value is "TA", which means that the condition and quality of the garages are average/typical. We will replace the missing values of these two variables therefore with "TA" as well. The variable "GarageYrBlt" refers to the year in which the garage was built. Since we also have the year in which the houses themselves are built we can drop this variable without losing much explaining information. In addition to that we also drop the "GarageFinish" and "GarageType" variable.

In [0]:
train = train.drop(columns=["GarageYrBlt", "GarageFinish", "GarageType"])
train["GarageCond"] = train.GarageCond.fillna(value="TA")
train["GarageQual"] = train.GarageQual.fillna(value="TA")

In the same way as above we take a closer look at the "Bsmt*" variables.

In [0]:
for var in ["BsmtFinType2", "BsmtExposure", "BsmtCond", "BsmtFinType1", "BsmtQual"]:
    print(pd.crosstab(index=train[var], columns="count"))

col_0         count
BsmtFinType2       
ALQ              19
BLQ              33
GLQ              14
LwQ              46
Rec              54
Unf            1256
col_0         count
BsmtExposure       
Av              221
Gd              134
Mn              114
No              953
col_0     count
BsmtCond       
Fa           45
Gd           65
Po            2
TA         1311
col_0         count
BsmtFinType1       
ALQ             220
BLQ             148
GLQ             418
LwQ              74
Rec             133
Unf             430
col_0     count
BsmtQual       
Ex          121
Fa           35
Gd          618
TA          649


We delete the "BsmtFinType*" variables since these are highly subjective and do not add much information to our model. The missing values of "BsmtCond" will be imputed with the most common value "TA". The rows containing missing values for "BsmtQual" and "BsmtExposure" will be deleted from the dataset.

In [0]:
train = train.drop(columns=["BsmtFinType1", "BsmtFinType2"])
train.BsmtCond = train["BsmtCond"].fillna(value="TA")
for var in ["BsmtQual", "BsmtExposure"]:
    train = train.drop(train.loc[train[var].isnull()].index)

The variable "Electrical" contains only 1 missing value, therefore we only delete this specific row of data. We proceed in the same way with "MasVnrType" and "MasVnrArea".

In [0]:
for var in ["Electrical", "MasVnrType", "MasVnrArea"]:
    train = train.drop(train.loc[train[var].isnull()].index)

Running the above code again to check if all missing values are deleted.

In [0]:
miss_count_train = train.isnull().sum().sort_values(ascending=False)
perc_miss_train = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missings_train = pd.concat([miss_count_train, perc_miss_train], axis=1, keys=["Total", "Percent"])
missings_train

This looks good.

We now handled all the missing data on the training set. As the next step we will clean the test data. Since we want to evaluate our model on Kaggle after finishing the modeling, we cannot drop any observations because we need predicted house prices for each row of the test data. Therefore, we will impute the missing data in the test dataset with the most frequent category for categorical features and the mean for numerical features.

In [0]:
test = test.drop(columns=["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "LotFrontage"])
test = test.drop(columns=["GarageYrBlt", "GarageFinish", "GarageType"])
test["GarageCond"] = test.GarageCond.fillna(value="TA")
test["GarageQual"] = test.GarageQual.fillna(value="TA")
test = test.drop(columns=["BsmtFinType1", "BsmtFinType2"])
test.BsmtCond = test["BsmtCond"].fillna(value="TA")


In [79]:
test.shape, train.shape

((1459, 69), (1413, 70))

In [0]:
# categorical:
for var in ["BsmtExposure", "BsmtQual", "MasVnrType", "MSZoning", "Utilities", "Functional", "SaleType", "Exterior2nd", "Exterior1st", "KitchenQual"]:
    test[var] = test[var].fillna(value=test[var].value_counts().index[0])

# numerical
for var in ["MasVnrArea", "BsmtHalfBath", "BsmtFullBath", "BsmtUnfSF", "GarageArea", "GarageCars", "BsmtFinSF1", "BsmtFinSF2", "TotalBsmtSF"]:
    test[var] = test[var].fillna(test[var].mean())

In [0]:
miss_count_test = test.isnull().sum().sort_values(ascending=False)
perc_miss_test = (test.isnull().sum() / test.isnull().count()).sort_values(ascending=False)
missings_test = pd.concat([miss_count_test, perc_miss_test], axis=1, keys=["Total", "Percent"])
missings_test

In [81]:
test.shape, train.shape

((1459, 69), (1413, 70))

Perfect. We do not have any missing values in our test set anymore and have kept all observations, for which we gonna predict the House Sale Price.

## Model selection 

We continue by building a Machine Learning Pipeline using Scikit Learn. A pipeline object sequentially applies a list of transformers and a final estimator. 


We will play around with different algorithms, tune their hyperparameters using Cross Validation and pick the best performing one.

Before we can work with our data, we first need to create separate dataframes containing our feature variables and the target variable. This needs to be done only for our training data since it is our aim to predict SalePrice for the test data, which is why it is not contained in this data.

In [0]:
y = train.SalePrice.values
X = train.drop("SalePrice", axis=1)

To make predictions and to fit models, the last step that has to be done is to convert all categorical features into numeric ones. This way scikit-learn can handle them. We do this by using pandas get_dummies() function. To make sure we end up with the same number of columns in both the training and the test dataset we first concatenate both, then apply get_dummies() and then separating them again.

In [89]:
X.shape, test.shape
type(X), type(test)

(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)

In [0]:
# creating a dummy to distinguish between train and test data
X["train"] = 1
test["train"] = 0
# concatenating dataframes and creating dummies from categorical features
combined = pd.concat([X, test])
df = pd.get_dummies(combined, drop_first=True)



In [92]:
df.shape

(2872, 214)

In [0]:
X = df[df["train"]==1]
X = X.drop("train", axis=1)
test = df[df["train"]==0]
test = test.drop("train", axis=1)
X.shape, test.shape

Before training and tuning our models we separate the data into training and test data to evaluate our models.

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [96]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1130, 213), (283, 213), (1130,), (283,))

 ### k-nearest Neighbors (KNN)

We will start with a relatively simple algorithm - k-nearest Neighbor or KNN.

In [0]:
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

Looking at the data, we notice that are our feature variables' ranges vary substantially between each other. Therefore we will add a transformer to our pipeline which standardizes the data. Standardization centers each variable around zero with unit variance. This is done by subtracting the means from each feature and dividing by its standard deviation.

After that we instantiate our KNN estimator, create a list containing the steps applied by the pipeline and then defining the pipeline.

In [0]:
# instantiate the scaling transformator 
scaler = preprocessing.StandardScaler()
# instantiate the KNN estimator
knn = KNeighborsRegressor(n_neighbors=10)
# creating a list containing the steps the pipeline is to apply
steps_knn =  [("scaler", scaler), ("knn", knn)]
# define the pipeline object
pipeline_knn = Pipeline(steps_knn)

The KNN algorithm has one parameter that can and should be tuned, which is the number of neighbors that should be considered. We will therefore define a dictionary containing all hyperparameters that should be tuned and define the different values that should be tested.

In [0]:
neighbors = {"knn__n_neighbors":list(range(1,21))}

Next we are gonna set up our Cross Validation (CV) object using 5-fold CV and fit it to our data.

In [0]:
cv_knn = GridSearchCV(pipeline_knn, neighbors, cv=5)

In [24]:
cv_knn.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('knn',
                                        KNeighborsRegressor(algorithm='auto',
                                                            leaf_size=30,
                                                            metric='minkowski',
                                                            metric_params=None,
                                                            n_jobs=None,
                                                            n_neighbors=10, p=2,
                                                            weights='uniform'))],
                                verbose=False),
 

In [25]:
cv_knn.best_params_

{'knn__n_neighbors': 12}

In [0]:
print(cv_knn.refit)

True


From the above output we conclude that the default of the argument "refit" is True. This means that, by default, our CV pipeline automatically refits the model on the entire training set using the best parameters found by CV, which is 12 in our case. Therefore we can now directly use the CV object to make predictions for unseen data.

As in the Kaggle challenge we will use the log of the Root mean squared error metric to evaluate our model's performance on the test set. To do this we first need to predict the sales price for the unseen test data.

In [26]:
from sklearn.metrics import mean_squared_error
y_pred = cv_knn.predict(X_test)
rmse_knn = np.sqrt(mean_squared_error(np.log(y_test), np.log(y_pred)))
print("The KNN algorithm with " + str(cv_knn.best_params_) + " yields a (log) RMSE of: " + str(rmse_knn))

The KNN algorithm with {'knn__n_neighbors': 12} yields a (log) RMSE of: 0.197561198068067


### Random forest

After trying out the KNN algorithm, we now continue with the Random Forest Algorithm. Using similar steps as before we will build up a pipeline object applying the transformations and the estimation on our dataset sequentially automatically.

In [0]:
from sklearn.ensemble import RandomForestRegressor

In [0]:
# instantiate the RandomForest Regressor
rf_reg = RandomForestRegressor(random_state=123)
# creating a list containing the steps the pipeline should apply
steps_rf = [("scaler", scaler), ("rf_reg", rf_reg)]
# create the pipeline object
pipeline_rf = Pipeline(steps_rf)

For RandomForests there are a large number of Hyperparameters that can be tuned. In this project we are going to tune the number of trees in the random forest [n_estimators], the number of features considered at every split [max_features], the maximum number of levels in a tree [max_depth], the mininum number of samples required to split a node [min_samples_split], the minimum number of observations required at each leaf node [min_samples_leaf] and if bootstrap should be used training each tree [bootstrap]. 

We first define the ranges for each hyperparameter that should be considered in our CV tuning and then create our search grid containing the before created lists.

In [0]:
n_estimators = list(np.arange(200, 2001, 200))
max_features = ["auto", "sqrt", "log2"]
max_depth = list(np.arange(10, 101, 10))
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4, 8]
bootstrap = [True, False]
param_dist = {"rf_reg__n_estimators": n_estimators,
              "rf_reg__max_features": max_features,
              "rf_reg__max_depth": max_depth,
              "rf_reg__min_samples_split": min_samples_split,
              "rf_reg__min_samples_leaf": min_samples_leaf,
              "rf_reg__bootstrap": bootstrap}

Testing all possible combinations of hyperparameters would ammount to testing $10*3*11*3*3*2 = 5,940$ combinations, instead of testing all of these we will use RandomizedSearchCV to select randomly from our defined distributions which combinations are tested. 

In [0]:
from sklearn.model_selection import RandomizedSearchCV
cv_random_rf = RandomizedSearchCV(pipeline_rf, param_dist, cv=3, n_iter=100)

Just as before with the KNN algorithm we now can fit the pipeline to our data.

In [0]:
cv_random_rf.fit(X_train, y_train)

In [0]:
cv_random_rf.best_params_

{'rf_reg__bootstrap': False,
 'rf_reg__max_depth': 80,
 'rf_reg__max_features': 'sqrt',
 'rf_reg__min_samples_leaf': 1,
 'rf_reg__min_samples_split': 2,
 'rf_reg__n_estimators': 400}

Based on the chosen parameters from RandomizedSearchCV we can now manually decrease the range of the hyperparameters to be tested and use GridSearchCV as before to find the best parameters for our model.

In [99]:
n_estimators_2 = list(np.arange(300, 501, 100))
max_depth_2 = list(np.arange(70, 101, 10))
max_features_2 = ["sqrt"]
min_samples_split_2 = [2, 3]
min_samples_leaf_2 = [1, 2, 3]
bootstrap_2 = [True, False]
param_grid = {"rf_reg__n_estimators": n_estimators_2,
              "rf_reg__max_features": max_features_2,
              "rf_reg__max_depth": max_depth_2,
              "rf_reg__min_samples_split": min_samples_split_2,
              "rf_reg__min_samples_leaf": min_samples_leaf_2,
              "rf_reg__bootstrap": bootstrap_2}
param_grid

{'rf_reg__bootstrap': [True, False],
 'rf_reg__max_depth': [70, 80, 90, 100],
 'rf_reg__max_features': ['sqrt'],
 'rf_reg__min_samples_leaf': [1, 2, 3],
 'rf_reg__min_samples_split': [2, 3],
 'rf_reg__n_estimators': [300, 400, 500]}

We can now instantiate and then fit our GridSearchCV object as before.

In [100]:
cv_rf = GridSearchCV(pipeline_rf, param_grid, cv=3)
cv_rf.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('rf_reg',
                                        RandomForestRegressor(bootstrap=True,
                                                              criterion='mse',
                                                              max_depth=None,
                                                              max_features='auto',
                                                              max_leaf_nodes=None,
                                                              min_impurity_decrease=0.0,
                                                              min_impurity_split=None,
                 

In [101]:
cv_rf.best_params_, print()




({'rf_reg__bootstrap': False,
  'rf_reg__max_depth': 70,
  'rf_reg__max_features': 'sqrt',
  'rf_reg__min_samples_leaf': 1,
  'rf_reg__min_samples_split': 2,
  'rf_reg__n_estimators': 400},
 None)

In [102]:
print(cv_rf.refit)

True


We see that the parameters chosen by RandomizedSearchCV were already pretty good and did not really need that much of a finetuning. Furthermore, the GridSearchCV function also already refit the RandomForest model on the whole training data using the parameters that are found to work best.

As before, we now can create predictions for our hold out test set and evaluate our model using log RMSE as before.

In [103]:
from sklearn.metrics import mean_squared_error
y_pred_rf = cv_rf.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(np.log(y_test), np.log(y_pred_rf)))
print("The RandomForest algorithm with " + str(cv_rf.best_params_) + " yields a (log) RMSE of: " + str(rmse_rf))

The RandomForest algorithm with {'rf_reg__bootstrap': False, 'rf_reg__max_depth': 70, 'rf_reg__max_features': 'sqrt', 'rf_reg__min_samples_leaf': 1, 'rf_reg__min_samples_split': 2, 'rf_reg__n_estimators': 400} yields a (log) RMSE of: 0.14303960215709952


Great! We see a clear improvement in our evaluation metric, i.e. the RandomForest Regressor performs much better than the KNN fitted at first.

### Making predictions and creating our submission file

Since we want to check our model's performance in comparison with other people's models, we need to make predictions for the test dataset provided by Kaggle and upload our results to Kaggle. According to the challenge's description, our submission should be a file containing the observation's ID and the predicted SalesPrice. We first create a dataframe and then save it as a .csv file.

In [124]:
ident = list(test.Id)
print(ident)

[1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470, 1471, 1472, 1473, 1474, 1475, 1476, 1477, 1478, 1479, 1480, 1481, 1482, 1483, 1484, 1485, 1486, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 1515, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1523, 1524, 1525, 1526, 1527, 1528, 1529, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 1559, 1560, 1561, 1562, 1563, 1564, 1565, 1566, 1567, 1568, 1569, 1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579, 1580, 1581, 1582, 1583, 1584, 1585, 1586, 1587, 1588, 1589, 1590, 1591, 1592, 1593, 1594, 1595, 1596, 1597, 1598, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607, 1608, 1609, 1610, 1611, 1612, 1613, 1614, 1615, 1616, 1617, 1618, 1619, 1620, 1621, 1622, 1623, 1624, 1625, 1626, 162

In [0]:
test_pred = list(cv_rf.predict(test))

In [122]:
len(ident), len(test_pred)

(1459, 1459)

In [125]:
submiss = {"Id": ident, "SalePrice": test_pred}
print(submiss)

{'Id': [1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470, 1471, 1472, 1473, 1474, 1475, 1476, 1477, 1478, 1479, 1480, 1481, 1482, 1483, 1484, 1485, 1486, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 1515, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1523, 1524, 1525, 1526, 1527, 1528, 1529, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 1559, 1560, 1561, 1562, 1563, 1564, 1565, 1566, 1567, 1568, 1569, 1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579, 1580, 1581, 1582, 1583, 1584, 1585, 1586, 1587, 1588, 1589, 1590, 1591, 1592, 1593, 1594, 1595, 1596, 1597, 1598, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607, 1608, 1609, 1610, 1611, 1612, 1613, 1614, 1615, 1616, 1617, 1618, 1619, 1620, 1621, 1622, 1623, 1624, 1625, 16

In [137]:
submission = pd.DataFrame(submiss)
print(submission.head())
"\n"
"\n"
"\n"
print(submission.tail())

     Id    SalePrice
0  1461  124888.4425
1  1462  156320.6625
2  1463  180089.4975
3  1464  191202.0350
4  1465  196634.6000
        Id    SalePrice
1454  2915   89004.0275
1455  2916   87991.6075
1456  2917  168285.7375
1457  2918  115401.2350
1458  2919  225721.2650


In [0]:
from google.colab import files

submission.to_csv('submission.csv', index=False)
# files.download('submission.csv')

We can now upload our submission file to Kaggle and check how well our model performs on the unseen test data.