# Introduction

Estimating the sale prices of houses is one of the basic projects to have on our Data Science CV. By finishing this Kernel, we will be able to predict continuous variables using various types of regressor algorithm. In this work, we want to perform the steps of data analysis and build a House price prediction model in the simplest, easiest and straight way, according to which:

+ Understand the problem: We will look at each variable and analyze philosophy about their meaning and importance to this problem.

+ We'll focus on 'SalePrice' variable and try to know a little more about it, making the simplest adjustment to be able to apply basic Machine Learning.

+ Independent variable: we will try to understand the relationship of the dependent variable and the independent variable.

+ Basic data cleaning: We will clean up the data set and process the missing data, outliers and categorize variables. We don't go in the same direction as all the other Kagglers did by merging the train set and the test set BUT we do separately, assuming the test set is unknown, the cleaning goal of the test data set is just to prediction manipulation is performed.

+ Statists: We will check to see if our data meets the assumptions required by most variable multivariate techniques.

# Now, it's time to have fun!

We are going to break everything into logical steps that allow us to ensure the cleanest, most realistic data for our model to make accurate predictions from. The layout of the Notebook is summarized as below:

# Section I: DATA PREPROCESSING & EDA

1. Importing Data and Libraries
2. Data cleaning: dealing with NaN or Null or missing data
3. Data visualization, variable correlations: key variable parameters?
4. Statistical (if any)

# Section II: HOUSE PRICE MODEL

1. Feature Selection, data handling & data Split
2. Data Labeling
3. Data spliting: training and testing
4. Selected "Best Model"
5. Model's Parameters tuning
6. Submission
7. Conclusion

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Section I: DATA PREPROCESSING & EDA

# 1. Importing Data and Libraries

Using the ‘read_csv’ function provided by the Pandas package, we can import the data into our python environment. After importing the data, we can use the ‘head’ function to get a glimpse of our dataset.

In [None]:
d_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
d_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
d_train.head(2)

In [None]:
d_train.shape

In [None]:
d_test.shape

Based on the summary above, we have total 1460 rows of data, 80 and 81 columns for the test and the data sets. Before we get into dealing with missing data, we will explore our dataset:

+ We have to check which one have impact on the target value?
+ But, wow ! 80 columns, so we would love to show all columns and rows, because it's easier to follow & check, by setting the following ...

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Now, importing some basic Librairies we might use ...

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import matplotlib.style as style
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

# 2. Data cleaning: dealing with NaN or Null or missing data

In this step, we start with removing all the null/NaN values that contain in our dataset. We can do this in Python using the ‘dropna’ function. We have different approaches to dealing with missing data, in this work, we assumpte that:

+ We will drop all columns where the data missing ratio > 20%. We could observe below that "Alley", "FireplaceQu", "Fence", "MiscFeature", "PoolQC" are probaly the 5 first columns we will remove. Also, "LotFrontage" could be the next column that we have to consider.
+ Dealing with the numerical missing data by replacing column's mean value, and the object columns will be considered in the next section.

Now, using the ‘describe’ function we can get a statistical view of the data like mean, median, standard deviation, and so on.

In [None]:
d_train.describe(include='all').T

Yes, we could confirm our observation above in this Table.

In [None]:
d_train.describe()

Now, Quick check on the test data set ...

In [None]:
d_test.isnull().sum().sort_values()

The missing values are very similar in both datasets. So, we decide to remove 5 columns in both datasets.

Now, we fill all NaN values (on the numerical columns) using the mean value of corresponding columns, by applying the fillna, and do not forget to implement the same action on the test data set.

In [None]:
d_train['YrSold'] = d_train['YrSold'].apply(str)
d_train['MoSold'] = d_train['MoSold'].apply(str)
d_test['YrSold'] = d_test['YrSold'].apply(str)
d_test['MoSold'] = d_test['MoSold'].apply(str)

In [None]:
d_test.columns

In [None]:
# categorical data
cat_cols=np.array(d_test.columns[d_test.dtypes == object])

for feature in cat_cols:
    d_train[feature].fillna(d_train[feature].mode()[0], inplace=True)
    d_test[feature].fillna(d_test[feature].mode()[0], inplace=True)    

# categorical data
num_cols=np.array(d_test.columns[d_test.dtypes != object])
for feature in num_cols:
    d_train = d_train.fillna(0)
    d_test = d_test.fillna(0)
    
d_train = d_train.fillna("Other")
d_test = d_test.fillna("Other")

Now, we feel better with the current dataset and take a look over the data trend by using the subplots ...

In [None]:
d_train.plot(subplots=True, sharex = True, figsize=(20,50))

We could observe that: 'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'KitchenAbvGr', '3SsnPorch', 'PoolArea', 'MiscVal' are VERY unbalanced, and we shoudl to find out a solution to handle with these columns in the modeling section.

# 3. Data visualization, variable correlations: key variable parameters?

Before processing null values in the object columns, we try to check the correlation between SalePrice and all other numerical variables by using the corr() function:

In [None]:
d_train.corr()['SalePrice'].sort_values(ascending=False)

We could see that, there are two main groups of correlated variables: POSITIVE and NEGATIVE.
Now, we want to see something more beautiful, such as graphics. 

# Data Visualization

In this process, we are going to produce three different types of charts including heatmap, scatter plot, and a distribution plot. Heatmaps are very useful to find relations between two variables in a dataset. Heatmap can be easily produced using the ‘heatmap’ function provided by the seaborn package in python.

In [None]:
style.use('ggplot')
sns.set_style('whitegrid')
plt.subplots(figsize = (30,30))

## Plotting heatmap. Generate a mask for the lower triangle (taken from seaborn example gallery)
mask = np.zeros_like(d_train.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(d_train.corr(), cmap=sns.diverging_palette(20, 220, n=200), annot=True, mask=mask, center = 0, );
plt.title("Heatmap of all the Features of Train data set", fontsize = 25);

NOT BAD at ALL !

YEAH ! We concluded some first observations that the SALEPRICE seem to be strongly-POSITIVE correlated to:

+ OverallQual
+ TotalBsmtSF
+ 1stFlrSF
+ GrLivArea
+ GarageCars, and
+ GarageArea

which means that as one variable increases, the SalePrice value also increases. OK, let's stop here to select these variables to analyse, there are probably other variable that should be considered in deep.

The main issue of the current problem is to be the choice of the right FEATURES and related to the TARGET value and NOT only the definition of complex relationships between them, we will discuss deeper in the next section.

# (a) SalePrice
- the property's sale price in dollars. This is the target variable that we are trying to predict.

Distribution plots are very useful to check how well a variable is distributed in the dataset. Let’s now produce a distribution plot using the ‘distplot’ combined with the 'boxplot' function to check the distribution of the ‘SalePrice’ variable in the dataset.

In [None]:
sns.set(style="ticks")
x = d_train['SalePrice']
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(12,7))

sns.boxplot(x, ax=ax_box)
sns.distplot(x, ax=ax_hist)
plt.axvline(x = x.mean(), c = 'red')
plt.axvline(x = x.median(), c = 'green')

ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
plt.show()
print("Skewness: %f" % d_train['SalePrice'].skew())
print("Kurtosis: %f" % d_train['SalePrice'].kurt())

+ Red line in histogram indicates the mean of the SalePrice and the Green line indicates the median.
+ Looking at the kurtosis score, we can see that there is a very nice peak. However, looking at the skewness score, we can see that the SalePrices deviate from the normal distribution. 
+ We want our data to be as "normal" as possible. This is just because the Machine Learning DOES  LIKE ONLY the NORMAL DISTRIBUTION.
+ For conclusion, this is a right skewed distribution or called a positive skew distribution. That’s because the tail is longer on the positive direction of the number line. A histogram is right skewed if the peak of the histogram veers to the left. Therefore, the histogram’s tail has a positive skew to the right.

Let's check a simplest way to correct the distribution of SalePrice by taking logarithm of the value.

In [None]:
sns.set(style="ticks")

x = (np.log1p(d_train['SalePrice']))
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(12,7))

sns.boxplot(x, ax=ax_box)
sns.distplot(x, ax=ax_hist)
plt.axvline(x = x.mean(), c = 'red')
plt.axvline(x = x.median(), c = 'green')

ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
plt.show()

We "feel" MUCH BETTER with the conversion of SalePrice into LOGARITH function, it's NOT PERFECT yet, but this is one of the simplest way to obtain a NORMAL DISTRIBUTION funtion, so we will apply this in the next section do train the model in the modeling section. 

# (b) OverallQual: 
Overall material and finish quality

In [None]:
fig = px.box(d_train, x="OverallQual", y="SalePrice")
fig.show()

As the OverallQua increases, price of the houses also increase. That makes sense.

In [None]:
yprop = 'SalePrice'
xprop = 'OverallQual'
h= 'LotArea'
px.scatter(d_train, x=xprop, y=yprop, color=h, marginal_y="violin", marginal_x="box", trendline="ols", template="simple_white")

# (c) OverallCond: 
Overall condition rating

In [None]:
yprop = 'SalePrice'
xprop = 'LotArea'
h= 'OverallCond'
px.scatter(d_train, x=xprop, y=yprop, color=h, marginal_y="violin", marginal_x="box", trendline="ols", template="simple_white")

In [None]:
d_train = d_train.drop(d_train[(d_train['SalePrice']>740000) & (d_train['SalePrice']<756000)].index).reset_index(drop=True)

# (d) TotalBsmtSF: 
Total square feet of basement area

Like heatmap, a scatter plot is also used to observe linear relations between two variables in a dataset. In a scatter plot, the dependent variable is marked on the x-axis and the independent variable is marked on the y-axis. In our case, the ‘SalePrice’ attribute is the dependent variable, and every other are the independent variables.

In [None]:
df = px.data.gapminder()
fig = px.scatter(d_train, y="SalePrice", x="LotArea", size="SalePrice", color="TotalBsmtSF",
           hover_name="LotArea", log_x=True, log_y=True, size_max=20)
fig.show()

# (e) 1stFlrSF: 
First Floor square feet

In [None]:
df = px.data.iris()
fig = px.scatter(d_train, x="1stFlrSF", y="SalePrice", color="GarageCars", marginal_y="violin",
           marginal_x="box", trendline="ols", template="simple_white")
fig.show()

In [None]:
d_train = d_train.drop(d_train[(d_train['1stFlrSF']>4690) & (d_train['1stFlrSF']<4700)].index).reset_index(drop=True)

# (f) GrLivArea: 
Above grade (ground) living area square feet

In [None]:
sns.jointplot(data=d_train, x='GrLivArea', y='SalePrice', kind='reg', height=8)

As recommended by the author of the data, Outlinear in the GrLivArea should be removed. The author stated that “I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these five unusual observations) before assigning it to students.” It makes sense that people would pay for the more living area. What doesn't make sense is the two datapoints in the bottom-right of the plot. We need to take care of this! What we will do is remove these outliers manually.

In [None]:
d_train = d_train.drop(d_train[(d_train['GrLivArea']>4000) & (d_train['SalePrice']<250000)].index).reset_index(drop=True)

# (g) GarageCars: 
Size of garage in car capacity

In [None]:
fig = px.violin(d_train, y="SalePrice", x="GarageCars", color=None, box=True, points="all", hover_data=d_train.columns)
fig.show()

Suprising! 4-car garages result in less Sale Price? That doesn't make much sense. Let's remove these outliers.

In [None]:
d_train = d_train.drop(d_train[(d_train['GarageCars']>3) & (d_train['SalePrice']<290000)].index).reset_index(drop=True)

# (h) GarageArea: 
Size of garage in square feet

In [None]:
fig = px.scatter(d_train, x="GarageArea", y="SalePrice", color="OverallCond", marginal_y="violin",
           marginal_x="box", trendline="ols", template="simple_white")
fig.show()

Again with the top & bottom two data-points. Let's remove these outliers.

In [None]:
d_train = d_train.drop(d_train[(d_train['GarageArea']>1240) & (d_train['GarageArea']<1400)].index).reset_index(drop=True)

# (k) LotArea: 
Lot size in square feet and other variables.

In [None]:
plt.figure(figsize=[15,20])
feafures = ['LotArea','MSSubClass','OverallQual','OverallCond','ExterQual','ExterCond','BsmtQual','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC','KitchenQual']
n=1
for f in feafures:
    plt.subplot(6,2,n)
    sns.boxplot(x=f,y='SalePrice',data = d_train)
    plt.title("Sale Price in function of {}".format(f))
    n=n+1
plt.tight_layout()
plt.show()

+ As we can see from all the above representation that many factors are affecting the prices of the house, like square feet which increases the price of the house and even location influencing the prices of the house.
+ Now that we are familiar with all these representation and can tell our own story let us move and create a model to which would predict the price of the house based upon the other factors. 

# 4. Statistical (if any)

In this section, we will check some hypothesis on the influence of independent variables on the target parameter. 
+ Hypothesis is checked at level of signidicant of 5%
+ Test statistic parameters are calculated using the ttest_ind from scipy.stats Library.

In [None]:
from scipy.stats import ttest_ind

def Series_stats(var, category, prop1, prop2):
# Step 1: State the null and alternative hypothesis and select a level of significance is 5% or 0.05
# Step 2: Collect data and calculate the values of test statistic
    s1 = d_train[(d_train[category]==prop1)][var]
    s2 = d_train[(d_train[category]==prop2)][var]
    t, p = ttest_ind(s1,s2,equal_var = False)

    print("Two-sample t-test: t={}, p={}".format(round(t,5),p))
# Step 3: Compare the probability associated with the test statistic with level of significance specified
    if ((p < 0.05) and (np.abs(t) > 1.96)):
        print("\n REJECT the Null Hypothesis and state that: \n at 5% significance level, the mean {} of {}-{} and {}-{} are not equal.".format(var, prop1, category, prop2, category))
        print("\n YES, the {} of {}-{} differ significantly from {}-{} in the current dataset.".format(var, prop1, category, prop2, category))
        print("\n The mean value of {} for {}-{} is {} and for {}-{} is {}".format(var, prop1, category, round(s1.mean(),2), prop2, category, round(s2.mean(),2)))
    else:
        print("\n FAIL to Reject the Null Hypothesis and state that: \n at 5% significance level, the mean {} of {} - {} and {} - {} are equal.".format(var, prop1, category, prop2, category))
        print("\n NO, the {} of {}-{} NOT differ significantly from {}-{} in the current dataset".format(var, prop1, category, prop2, category))
        print("\n The mean value of {} for {}-{} is {} and for {}-{} is {}".format(var, prop1, category, round(s1.mean(),2), prop2, category, round(s2.mean(),2)))

(a) Doe the SalePrice of House that OverallQual of 1 and 2 are equal?

In [None]:
Series_stats('SalePrice','OverallQual',1,10)

(b) Does the SalePrice of LotArea (Lot size in square feet) 8450 and 13175 sqf are equal?

In [None]:
Series_stats('SalePrice','LotArea',8450,13175)

Street: Type of road access

In [None]:
Series_stats('SalePrice','Street','Pave', 'Grvl')

# Section II: HOUSE PRICE MODEL

In this Kernel, we do not discuss in deep about the Models' parameters, we just applied the standard or refer to previous recommendations. Let's copy the database.

# 1. Feature Selection, data handling

As we said before, in this process we are going to define the ‘X_train’ variable (independent variable) and the ‘y_train’ variable (dependent variable). After defining the variables, we will use them to split the data into a train set and test set. Splitting the data can be done using the ‘train_test_split’ function provided by scikit-learn in Python.

In [None]:
d_test.Functional.unique()

One of our most time consuming operations when doing this Kernel is processing the data in order to perform House-price prediction step using the testdataset. There are a lot of null or NaN values or object variables present in the test file that don't appear in the train file and we got errors during LabelEncoding or final prediction step. That is why, we proceed to the next step: checking the difference between train set and test data set.

In [None]:
Check = pd.DataFrame(index=None, columns=['Feature','Missing from Test to Train', 'Items'])
cols=np.array(d_test.columns[d_test.dtypes == object])
for fe in cols:
    listtrain = d_train[fe]
    listtest = d_test[fe]
    Check = Check.append(pd.Series({'Feature':fe, 'Missing from Test to Train': len(set(listtest).difference(listtrain)), 'Items':set(listtest).difference(listtrain) }),ignore_index=True )
Check

In the first setp, all missing values in the object column are replace with the most common value in the column.

Now, check to confirm again, if there is any NaN or missing value in the datasets!

In [None]:
d_train.head(2)

In [None]:
d_train.isnull().sum()

In [None]:
d_test.isnull().sum()

Now, check again to make sure before going to the next step.

In [None]:
Check = pd.DataFrame(index=None, columns=['Feature','Missing from Test to Train', 'Items'])
cols=np.array(d_test.columns[d_test.dtypes == object])
for fe in cols:
    listtrain = d_train[fe]
    listtest = d_test[fe]
    Check = Check.append(pd.Series({'Feature':fe, 'Missing from Test to Train': len(set(listtest).difference(listtrain)), 'Items':set(listtest).difference(listtrain) }),ignore_index=True )
Check

At this stage, we decided to select 61 variables and remove all following columns from the model: 'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'KitchenAbvGr', '3SsnPorch', 'PoolArea', 'MiscVal'.

In [None]:
f_train = ['MSSubClass', 'MSZoning', 'LotArea', 'Street','LotShape', 'LandContour', 'Utilities', 'LotConfig',
           'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 
           'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
           'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
           'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 
           'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath', 
           'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType',
           'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 
           'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice']
f_test = ['MSSubClass', 'MSZoning', 'LotArea', 'Street','LotShape', 'LandContour', 'Utilities', 'LotConfig',
           'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 
           'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
           'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
           'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 
           'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath', 
           'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType',
           'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 
           'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']
df_train = pd.DataFrame(d_train, columns=f_train)
df_test = pd.DataFrame(d_test, columns=f_test)

In [None]:
from scipy.stats import norm, skew

numeric_feats = df_test.dtypes[df_test.dtypes != 'object'].index
skewed_feats = df_test[numeric_feats].apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skewed_feats[abs(skewed_feats) > 1]
high_skew

In [None]:
for feature in high_skew.index:
    df_train[feature] = np.log1p(df_train[feature])
    df_test[feature] = np.log1p(df_test[feature])

Take a deepcopy on both full datasets, and then map all objects columns by applying the map(str) function.

In [None]:
import copy
train=copy.deepcopy(df_train)
test=copy.deepcopy(df_test)

cols=np.array(df_train.columns[df_train.dtypes != object])
for i in train.columns:
    if i not in cols:
        train[i]=train[i].map(str)
        test[i]=test[i].map(str)
train.drop(columns=cols,inplace=True)
test.drop(columns=np.delete(cols,len(cols)-1),inplace=True)

# 2. Data Labeling

As you might know by this setp, we can’t have text in our data if we’re going to run any kind of model on it. So before we can run a model, we need to make this data ready for the model and to convert this kind of categorical text data into model-understandable: "numerical data", we use the Label Encoder class.

In [None]:
df_train.head(3)

In [None]:
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict

# build dictionary function
cols = np.array(df_train.columns[df_train.dtypes != object])
d    = defaultdict(LabelEncoder)

# only for categorical columns apply dictionary by calling fit_transform 
train = train.apply(lambda x: d[x.name].fit_transform(x))
test  = test.apply(lambda x: d[x.name].transform(x))
train[cols] = df_train[cols]
test[np.delete(cols,len(cols)-1)]=df_test[np.delete(cols,len(cols)-1)]

Now, let see our final results on data processing results !

In [None]:
train.head(2)

In [None]:
test.head(2)

In [None]:
test['YrBltAndRemod']=test['YearBuilt']+test['YearRemodAdd']
test['TotalSF']=test['TotalBsmtSF'] + test['1stFlrSF'] + test['2ndFlrSF']
test['Total_sqr_footage'] = (test['BsmtFinSF1'] + test['1stFlrSF'] + test['2ndFlrSF'])
test['Total_Bathrooms'] = (test['FullBath'] + (0.5 * test['HalfBath']) +test['BsmtFullBath'] )
test['Total_porch_sf'] = (test['OpenPorchSF'] + test['EnclosedPorch'] +test['WoodDeckSF'])

train['YrBltAndRemod']=train['YearBuilt']+train['YearRemodAdd']
train['TotalSF']=train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']
train['Total_sqr_footage'] = (train['BsmtFinSF1']  +train['1stFlrSF'] + train['2ndFlrSF'])
train['Total_Bathrooms'] = (train['FullBath'] + (0.5 * train['HalfBath']) +train['BsmtFullBath'] )
train['Total_porch_sf'] = (train['OpenPorchSF'] + train['EnclosedPorch'] +train['WoodDeckSF'])

# 3. Data Spliting: Training and Testing

We split our dataset into training, testing data with a 90:10 split ratio (As learned from school, this ratio should be 70:30 or 80:20, but we experience here this ratio is better :) ). The splitting was done by picking at random which results in a balance between the training data and testing data amongst the whole dataset. This is done to avoid overfitting and enhance generalization. Finaly, we selected 61 characters in the dataset to train the model.

But, we have to import Libaries for this section first !

In [None]:
from sklearn import metrics
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost.sklearn import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor,AdaBoostRegressor,BaggingRegressor, RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb

Define the errors function, that help to calculate the accuracy of each model.

In [None]:
def Errors(model, X_train, y_train, X_test, y_test):
    ATrS =  model.score(X_train,y_train)
    ATeS = model.score(X_test,y_test)
    RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
    MSE = mean_squared_error(y_test, y_pred)
    return ATrS, ATeS, RMSE, MSE

# Baseline Models (Regressor)

We use train data and test data: train data to train our machine and test data to see if it has learnt the data well or not.

And DO NOT forget to fixing "skewed" features. Here, we fix all of the skewed data to be more normal so that our models will be more accurate when making predictions: HOPELY :)

And, we create a DataFrame to store all the calculation results, including model name and errors.

In [None]:
train.isnull().sum()

In [None]:
X = train.drop(columns=['SalePrice']).values
y = np.log1p(train["SalePrice"])
Z = test.values

scaler = preprocessing.StandardScaler().fit(X)
scaler.transform(X) 
scaler.transform(Z)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.00001, random_state = 12)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size = 0.25, random_state = 12)

In [None]:
Acc = pd.DataFrame(index=None, columns=['model','Root Mean Squared  Error','Accuracy on Traing set','Accuracy on Testing set', 'Mean square error'])
regressors = [['DecisionTreeRegressor',DecisionTreeRegressor()],
              ['XGBRegressor', XGBRegressor()],
              ['CatBoostRegressor', CatBoostRegressor(verbose= False)],
              ['LGBMRegressor',lgb.LGBMRegressor()],
              ['GradientBoostingRegressor',GradientBoostingRegressor()],
              ['ExtraTreesRegressor',ExtraTreesRegressor()]]

for mod in regressors:
    name = mod[0]
    model = mod[1]
    model.fit(X_train1,y_train1)
    y_pred = model.predict(X_test1)
    ATrS, ATeS, RMSE, MSE = Errors(model, X_train1, y_train1, X_test1, y_test1)
    Acc = Acc.append(pd.Series({'model':name, 'Root Mean Squared  Error': RMSE,'Accuracy on Traing set':ATrS,'Accuracy on Testing set':ATeS, 'Mean square error':MSE}),ignore_index=True )
    
Acc.sort_values(by='Mean square error')

So now, we have train data, test data. After fitting our data to different models we can check the score of our data and the prediction is MUCH LOWER than our aim of 90%. So how do we achieve that target?

In this Kernel, we used a different method, which is very important for weak prediction models such as this. This might seem to be a bit advanced but if understood is a really brilliant tool to enable better predictions.

For building a prediction model, many experts use Gradient Boosting regression, CatbootRegressor, ... and we will check these models in the next section.

For illustration purpose, we defined a function to compare the acutal and predicted SalePrice on the same Graphic.

In [None]:
def Graph_prediction(n, y_actual, y_predicted):
    y = np.exp(y_actual)
    y_total = np.expm1(y_predicted)
    number = n
    aa=[x for x in range(number)]
    plt.figure(figsize=(25,10)) 
    plt.plot(aa, y[:number], '.', label="actual")
    plt.plot(aa, y_total[:number], 'o', label="prediction")
    plt.xlabel('SalePrice prediction of first {} Houses'.format(number), size=15)
    plt.legend(fontsize=15)
    plt.show()

# 4. Model's Parameters tuning

Hyperparameter tuning has to with setting the value of parameters that the algorithm cannot learn on its own. As such, these are constants that you set as the researcher. The problem is that you are not any better at knowing where to set these values than the computer. Therefore, the process that is commonly used is to have the algorithm use several combinations  of values until it finds the values that are best for the model. Having said this, there are several hyperparameters we need to tune, and they are as follows.

+ number of estimators: The number of estimators is show many trees to create. The more trees the more likely to overfit. 
+ learning rate: The learning rate is the weight that each tree has on the final prediction.
+ subsample: Subsample is the proportion of the sample to use.
+ max depth: Max depth was explained previously.

What we will do now is make an instance of the GradientBoostingRegressor. Next, we will create our grid with the various values for the hyperparameters. We will then take this grid and place it inside GridSearchCV function so that we can prepare to run our model.

In [None]:
GBR = GradientBoostingRegressor(n_estimators=8000, learning_rate=0.003, max_depth=4, max_features='sqrt', min_samples_leaf=10,
                                min_samples_split=5, loss='huber', random_state =42)  

GBR.fit(X_train,y_train)
ATrS, ATeS, RMSE, MSE = Errors(GBR, X_train1, y_train1, X_test1, y_test1)
print("Root Mean Squared: {}, Accuracy Train set: {},Accuracy Test set: {}, Mean square error: {}".format(RMSE, ATrS, ATeS, MSE))
result1 = GBR.predict(Z)

In [None]:
from sklearn.model_selection import GridSearchCV
gbr = GradientBoostingRegressor()
params = {'loss': ['ls','huber'], 
          'learning_rate': [0.01, 0.012, 0.015], 
          'max_depth': [2, 3, 4], 
          'min_samples_leaf' : [9, 10, 12],
          'min_samples_split' : [2, 3, 4]}
#gs = GridSearchCV(estimator = gbr, param_grid = params, scoring = 'explained_variance', cv = 10, n_jobs = -1)
#gs.fit(X_train,y_train)
#print("Best Score:", gs.best_score_)
#print("Best Parameters :",gs.best_params_)

In [None]:
from catboost import CatBoostRegressor
import numpy as np

train_data = X_train
train_labels = y_train

model = CatBoostRegressor()

grid = {'iterations': [7000, 8000],'learning_rate': [0.001, 0.0045, 0.01, 0.1],
        'depth': [2, 3, 4],'l2_leaf_reg': [1, 2],'random_seed': [12]}

#grid_search_result = model.grid_search(grid, X=train_data, y=train_labels, plot=True)

In [None]:
params = {'iterations': 12000,'learning_rate': 0.008,'depth': 6,'l2_leaf_reg': 2,'eval_metric':'RMSE',
          'verbose': False,'random_seed': 12}
         
CBR = CatBoostRegressor(**params)
CBR.fit(X_train,y_train)

ATrS, ATeS, RMSE, MSE = Errors(CBR, X_train1, y_train1, X_test1, y_test1)
print("Root Mean Squared: {}, Accuracy Train set: {},Accuracy Test set: {}, Mean square error: {}".format(RMSE, ATrS, ATeS, MSE))
result2 = CBR.predict(Z)

In [None]:
import xgboost as xgb
from bayes_opt import BayesianOptimization
from sklearn.metrics import mean_squared_error

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test)

def xgb_evaluate(learning_rate, max_depth, gamma, subsample, colsample_bytree, reg_alpha):
    params = {'learning_rate':learning_rate,
              'max_depth': int(max_depth),
              'gamma': gamma,
              'subsample':subsample,
              'colsample_bytree': colsample_bytree,
              'reg_alpha':reg_alpha}
    cv_result = xgb.cv(params, dtrain, num_boost_round=100, nfold=3)    
    
    # Bayesian optimization only knows how to maximize, not minimize, so return the negative RMSE
    return -1.0 * cv_result['test-rmse-mean'].iloc[-1]
model = xgb.XGBRegressor()
optimizer = BayesianOptimization(xgb_evaluate, {'learning_rate':(0.005, 0.03),
                                                'max_depth':(2, 4),
                                                'gamma':(0., 0.3),
                                                'subsample':(0.5,1),
                                                'colsample_bytree':(0.3,0.8),
                                                'reg_alpha':(0.005, 0.02)})
# Use the expected improvement acquisition function to handle negative numbers
# Optimally needs quite a few more initiation points and number of iterations
#optimizer.maximize(init_points=5, n_iter=15)
#optimizer.max

In [None]:
XGBR = xgb.XGBRegressor(colsample_bytree=0.46, gamma=0.047, learning_rate=0.05, max_depth=4, min_child_weight=1.8, 
                 n_estimators=5000,reg_alpha=0.46, reg_lambda=0.85,subsample=0.52, random_state = 7, nthread = -1)

XGBR.fit(X_train,y_train)
    
ATrS, ATeS, RMSE, MSE = Errors(XGBR, X_train1, y_train1, X_test1, y_test1)
print("Root Mean Squared: {}, Accuracy Train set: {},Accuracy Test set: {}, Mean square error: {}".format(RMSE, ATrS, ATeS, MSE))
result3 = XGBR.predict(Z)

In [None]:
LGBMR = lgb.LGBMRegressor(objective='regression', num_leaves=5,learning_rate=0.01, n_estimators=4000,max_bin=200, 
                         bagging_fraction=0.8,bagging_freq=4, bagging_seed=8,feature_fraction=0.2,feature_fraction_seed=10,
                         min_sum_hessian_in_leaf = 15,verbose=-1,random_state=12)
LGBMR.fit(X_train,y_train)
    
ATrS, ATeS, RMSE, MSE = Errors(LGBMR, X_train1, y_train1, X_test1, y_test1)
print("Root Mean Squared: {}, Accuracy Train set: {},Accuracy Test set: {}, Mean square error: {}".format(RMSE, ATrS, ATeS, MSE))
result4 = LGBMR.predict(Z)

# 6. Submission

In [None]:
Graph_prediction(300, y_train, GBR.predict(X_train))

In [None]:
result = np.expm1((result1 + result2 + result3 + result4)/4)
sub = pd.DataFrame()
sub = pd.DataFrame({'Id':d_test.Id,'SalePrice':result}) 
sub.to_csv('submission.csv',index=False)
sub.head(3)

# 7. Conclusion

This Kernel investigates different models for housing price prediction. Different types of Machine Learning methods including CatBoostRegressor, GradientBoostingRegressor and LightGBM and two techniques in machine learning are compared and analyzed for optimal solutions. Eventhough all of those methods achieved desirable results, different models have their own pros and cons. 

The GradientBoostingRegressor is probably the best one and has been selected for this problem. The BayesianOptimization method is simple but performsa lot better than the three other availabel methods due to the generalization.

Finally, the CatBoostRegressor is the best choice when parametrerization is the top priority.