# Kaggle House Prices Challenge

## House Prices: Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

(Link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview)

First, importing the needed libraries for this project.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading the data into dataframes.

In [85]:
test = pd.read_csv("test.csv")
train = pd.read_csv("train.csv")

In [4]:
test.shape, train.shape

((1459, 80), (1460, 81))

In [5]:
test.columns, train.columns

(Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
        'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
        'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
        'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
        'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
        'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
        'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
        'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
        'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
        'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
        'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
        'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
        'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
        'GarageCond

We can see that the test dataset contains one more variable compared to the train dataset - which is the "SalePrice" variable. In our analysis / prediction this serves as the dependent variable we want to predict given the houses' characteristics.

In [92]:
train.SalePrice.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

## Data Preprocessing and cleaning

We'll start the Data Cleaning by checking if the dependent variable in the test dataset contains any missing values.

In [45]:
train.SalePrice.isnull().sum()

0

All observations contain data for the target variable, therefore we can continue by taking a look at all the other variables contained in the train and test dataset.

In [70]:
miss_count_train = train.isnull().sum().sort_values(ascending=False)
perc_miss_train = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missings_train = pd.concat([miss_count_train, perc_miss_train], axis=1, keys=["Total", "Percent"])

miss_count_test = test.isnull().sum().sort_values(ascending=False)
perc_miss_test = (test.isnull().sum() / test.isnull().count()).sort_values(ascending=False)
missings_test = pd.concat([miss_count_test, perc_miss_test], axis=1, keys=["Total", "Percent"])

In [81]:
missings_train.head(20)

Unnamed: 0,Total,Percent
PoolQC,1453,0.995205
MiscFeature,1406,0.963014
Alley,1369,0.937671
Fence,1179,0.807534
FireplaceQu,690,0.472603
LotFrontage,259,0.177397
GarageCond,81,0.055479
GarageType,81,0.055479
GarageYrBlt,81,0.055479
GarageFinish,81,0.055479


In [82]:
missings_test.head(20)

Unnamed: 0,Total,Percent
PoolQC,1456,0.997944
MiscFeature,1408,0.965045
Alley,1352,0.926662
Fence,1169,0.801234
FireplaceQu,730,0.500343
LotFrontage,227,0.155586
GarageCond,78,0.053461
GarageQual,78,0.053461
GarageYrBlt,78,0.053461
GarageFinish,78,0.053461


As a rule of thumb we completely ignore columns that contain at least 15% missing values and will not try to impute the missing values with any kind of computed values, e.g. means. Therefore, we will delete the variables "PoolQC", "MiscFeature" and "Alley".

In [87]:
train = train.drop(columns=["PoolQC", "MiscFeature", "Alley"])
test = test.drop(columns=["PoolQC", "MiscFeature", "Alley"])

KeyError: "['PoolQC' 'MiscFeature' 'Alley'] not found in axis"

Running the above code again to make sure the desired columns are deleted.

In [88]:
miss_count_train = train.isnull().sum().sort_values(ascending=False)
perc_miss_train = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missings_train = pd.concat([miss_count_train, perc_miss_train], axis=1, keys=["Total", "Percent"])

miss_count_test = test.isnull().sum().sort_values(ascending=False)
perc_miss_test = (test.isnull().sum() / test.isnull().count()).sort_values(ascending=False)
missings_test = pd.concat([miss_count_test, perc_miss_test], axis=1, keys=["Total", "Percent"])

In [90]:
missings_train

Unnamed: 0,Total,Percent
Fence,1179,0.807534
FireplaceQu,690,0.472603
LotFrontage,259,0.177397
GarageCond,81,0.055479
GarageType,81,0.055479
GarageYrBlt,81,0.055479
GarageFinish,81,0.055479
GarageQual,81,0.055479
BsmtExposure,38,0.026027
BsmtFinType2,38,0.026027


In [91]:
missings_test

Unnamed: 0,Total,Percent
Fence,1169,0.801234
FireplaceQu,730,0.500343
LotFrontage,227,0.155586
GarageQual,78,0.053461
GarageYrBlt,78,0.053461
GarageFinish,78,0.053461
GarageCond,78,0.053461
GarageType,76,0.052090
BsmtCond,45,0.030843
BsmtQual,44,0.030158


The variables "GarageCond", "GarageQual", "GarageYrBlt" and "GarageFinish" contain exactly the same number of missing values, so we'll take a closer look at these variables.