# House Prices Analysis

Thanks for coming to my notebook. I will be very interested in your comments or corrections.

The notebook is divided into following parts:
   1. Libraries and data
   
      Definition of basic libraries used for this notebook and data downloading. For the data cleaning train and test sets are combined keeping auxiliary variable to differentiate them later.
      
   2. Data cleaning
   
      The longest stage of the investigation focused on complete cleaning and understanding the data peculiarities. We define numeric and categorical variables, differentiating nominal and ordinal features as subgroups of categorical ones. We add one interaction and correct empty entries. Last we transform nominal variables into dummy ones.
      
   3. Data analysis
   
      We scale our numeric variables. Then divide the data set into train and test for the next stages. Further, we analyse the target variable moments and resulting distribution. Next, correlation analysis is done and two features' selection methods are used: F statistic and LASSO. At this stage we assumed linearity.
   
   4. Estimation
   
      Some basic methods are explored: Generalised Linear Model, Random Forest and Extreme Boosting. We figure out which method offers the best prediction at this stage and fine-tune it.

# **1. Libraries and data**

First, we will define basic libraries and data source. For this we will use popular data set which contains house prices data. The set is small (<1Mb), but the quality of data seems to be good so it sounds like attractive toy.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing

from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif, SelectFromModel
from sklearn.linear_model import LogisticRegression,TweedieRegressor
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn.ensemble import RandomForestRegressor

from scipy.stats import variation
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.preprocessing import minmax_scaling
import math
from xgboost import XGBRegressor

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
RandState = 100

Define train and test data sets. For this we import data set, which is already divided into two parts. 

In [None]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv', index_col='Id')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv', index_col='Id')

In [None]:
train.head()

In [None]:
test.head()

As we want to apply the same data cleaning and preprocessing, then we will temporarily connect them

In [None]:
train['IsTrain']  = 1
test['IsTrain']  = 0

DataRaw = pd.concat([train, test])

In [None]:
DataRaw.head()

We have 80 columns for test and 81 of them for training. First column is just ID, let's keep this one but of course not use for estimation. The difference in columns' numbers is variable "SalePrice" so our target one.

First, we look at some global measures of our data set. Among them:
* count - number of non-empty entries
* mean - average value
* std - standard deviation
* min, max - lowest and highest value ifor the variable
* 25%,50%,75% - next percentiles

In [None]:
DataRaw.describe()

In [None]:
print("Number of features not counting the target:" + str(len(DataRaw.columns) - 1 - 1 )) #First "-1" is our target; second "-1" is "IsTrain" binary factor

# **2. Data cleaning**

Define variables' categories. First we will focus on numeric variables, second on categorical ones. The scope of this stage is checking the quality of data, namely what values are missing, but also what values do not make sense (for example: years of construction > 2020).

In [None]:
C = (DataRaw.dtypes == 'object')
CategoricalVariables = list(C[C].index)

print(CategoricalVariables)
print("")
print("The number of categorical variables:" + str(len(CategoricalVariables)))

In [None]:
Integer = (DataRaw.dtypes == 'int64') 
Float   = (DataRaw.dtypes == 'float64') 
NumericVariables = list(Integer[Integer].index) + list(Float[Float].index)

print(NumericVariables)
print("")
print("The number of numeric variables:" + str(len(NumericVariables)))

Ok, we see that we have 43 categorica variables and 36 numeric ones. It makes sense since we checked before that not counting ID, we should find 79 of them.

Now, we will investigate the quality of the given data. First, let's check how many entries are missing

In [None]:
Missing_Percentage = (DataRaw.isnull().sum()).sum()/np.product(DataRaw.shape)*100

print("The number of missing entries: " + str(round(Missing_Percentage,2)) + " %")

About 6% is quite a good score. Definitely we have couple of useful variables then

Let's look at missing values per variable, starting from numeric features as they usually play decisive role in modeling

In [None]:
Numeric_NaN = DataRaw[NumericVariables].isnull().sum()
RowsCount = len(DataRaw.index)

print("The percentage number of missing entries per variable: ", format(round(Numeric_NaN/RowsCount * 100)) )

So for all of them the number of missing entries is reasonable. The biggest value we can observe for LotFrontage (17%), what is still acceptable. Anyway, we don't want to leave with these empty entries. For that, we will just apply imputation. It is simple proxy method which automatically fills empty spaces with 'imputed' values, the default value is 'median'. For variable "GarageYrBlt" some records have values around 2200, all these futuristic garages which came from future will be placed as median

In [None]:
CleanedNumeric = DataRaw[NumericVariables]

CleanedNumeric['GarageYrBlt']=CleanedNumeric['GarageYrBlt'].fillna(CleanedNumeric['GarageYrBlt'].median())
CleanedNumeric.GarageYrBlt[CleanedNumeric.GarageYrBlt > 2020]=CleanedNumeric['GarageYrBlt'].median()
CleanedNumeric['LotFrontage']=CleanedNumeric['LotFrontage'].fillna(CleanedNumeric['LotFrontage'].median())
CleanedNumeric=CleanedNumeric.fillna(0)

CleanedNumeric.head()
CleanedNumeric.describe()

We will investigate which variables can be useful. Namely, we will investigate their variance, the variables which have whole exposure in just one level can't be very supportive for this analysis. Why? Let's assume that we have a variable "Apartment Type" having the data of Tokio center. Almost all the records will go to level 'one-storey flat' and remaining categories like 'detached house', 'mansion' etc. will be empty. 

In the previous part, we used "DataRaw.describe()" to figure out the standard deviation of our variables. It is definitely useful information, but standard deviation's disadvantage is that we need to know the proportion between it and the mean. Namely, it is very hard to say whether for example 'std = 1000' is of big size or not not analysing the particular variable. For this we will use another simple measure - coefficient of variation which takes into account also the volume of variable.

In [None]:
CoefVar = pd.DataFrame(variation(CleanedNumeric),index=NumericVariables,columns=['CoefVar']).sort_values(by=['CoefVar'])

CoefVar

First, we look at variables with the lowest coefficient of variation, then starting with: "YrSold"

In [None]:
sns.distplot(a=CleanedNumeric['YrSold'], kde=False)

In this case, obviously all dispersion measures are low, cause the difference of 4-6 years in comparison to ~2000 is small. However, of course the variable is useful as we are not interested in last 2000 years, but just 5. We repeat this practive for the next variables.

In [None]:
def PlotDist(NameOfVar):
    sns.distplot(a=CleanedNumeric[NameOfVar], kde=False)   
    
sns.distplot(a=CleanedNumeric['PoolArea'], kde=False)
sns.distplot(a=CleanedNumeric['MiscVal'], kde=False)
sns.distplot(a=CleanedNumeric['LowQualFinSF'], kde=False)
sns.distplot(a=CleanedNumeric['3SsnPorch'], kde=False)
sns.distplot(a=CleanedNumeric['BsmtHalfBath'], kde=False)

Above five variables were listed starting from "PoolArea". These ones have huge coefficient of variance, as we can see on the above graph almost all values remain at 0, then the mean is around 0 what leads to tiny huge value for measure of dispersion. These features with coefficient of variation higher than let's say 2 are highly neglectable. It is important information for us for future choices, for this analysis we will keep them in scope.

Let's say that these all works should resolve numeric problems. 
****Now, we will look at categorical variables.**** For this the idea will be very similar, so starting from NaN's

In [None]:
Categorical_NaN = DataRaw[CategoricalVariables].isnull().sum()
RowsCount = len(DataRaw.index)

print("The percentage number of missing entries per variable: ", format(round(Categorical_NaN/RowsCount * 100)) )

The situation is interesting. For some variables multiple entries are lacking. Investigating it a bit, we observe that variables with high number of missing entries correspond only to very luxurious houses. For example: "Alley" informs us what type of lane we have in our garden, FireplaceQu determines the fireplace material, MiscFeature informs about another features, according to description these features can be tennis court, elevator (!) etc. All in all, as these variables have significant number of empty entries, we will drop them but use them to create ordinal variable as interaction between them.

In [None]:
LuxuriousCategoricalVariables = ['Alley','FireplaceQu','PoolQC','Fence','MiscFeature']

CategoricalVariables = [x for x in CategoricalVariables if x not in LuxuriousCategoricalVariables]

print(CategoricalVariables)
print(LuxuriousCategoricalVariables)

For all categorical variables which don't belong to class 'luxurious', we will apply NA correction by imputing level "Unknown"

In [None]:
CleanedCategorical= DataRaw[CategoricalVariables].fillna('Unknown')

Let's define our 'luxurious' interaction. The idea behind is simple, the more features you have from the list of 5 fancy thingies, the higher you are. Namely, something like: you have only fireplace but nothing else, then you receive 1, you have elevator, fireplace and alley in garden, then you receive 3 etc.

In [None]:
LuxuriousCategorical = DataRaw[LuxuriousCategoricalVariables]

LuxuriousCategorical = pd.concat([LuxuriousCategorical, pd.DataFrame(DataRaw[LuxuriousCategoricalVariables].isnull().sum(axis = 1),
                                                                     columns=['Luxurious_Features'])], axis=1,sort=False)

#The function was calculating the number of NaN, hence we inverted it to make more intuitive
LuxuriousCategorical['Luxurious_Features']=-LuxuriousCategorical['Luxurious_Features']+6 

LuxuriousCategorical.head()

Let's now add the new interaction to our main categorical data set. In that way, we create ordinal variable which is simply an interaction

In [None]:
CleanedCategorical = pd.merge(CleanedCategorical,
                 LuxuriousCategorical['Luxurious_Features'],
                 on='Id')

CleanedCategorical.head()

Ok, now this set of categorical variables looks a bit cleaner. Let's do three things now:
    1. Investigate which variables can be transformed into ordinal ones
    2. Analyse cardinality of our remaining categorical variables: check simply how many different levels they have. Categorical variables which can't be represented in ordered list will be referred as "nominal variables"
    3. Apply encoding to transform these categorical variables

It's good idea to investigate first variables which have "Qual" in names cause this shortcut refers to "Quality", in other words we expect that some levels will indicate lower quality, while other ones higher one, what can enable us to order them. For this we print unique levels as follows

In [None]:
CleanedCategorical['ExterQual'].unique()


Indeed, our guide informs us that:
1. Ex	Excellent
2. Gd	Good
3. TA	Average/Typical
4. Fa	Fair
5. Po	Poor

Alright, on the basis of this we will make mappings. Honestly, these levels were not very intuitive, at the moment it should be simpler. For me the higher numbers is, the better is, so till the end of this notebook all positive levels will receive high numbers while bad ones low numbers (as below).

In [None]:
Quality_map  = {'NaN':1, 'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5}

CleanedCategorical['ExterQual'] = CleanedCategorical['ExterQual'].map(Quality_map)
CleanedCategorical['ExterCond'] = CleanedCategorical['ExterCond'].map(Quality_map)
CleanedCategorical['HeatingQC'] = CleanedCategorical['HeatingQC'].map(Quality_map)
CleanedCategorical['KitchenQual'] = CleanedCategorical['KitchenQual'].map(Quality_map)

These were quite simple, let's look further. We notice the variable: "BsmtQual" what corresponds to basement quality. And here arises the question: is it better to not have basement or have poor basement. For me it's still better to have any basement, even if it's poor. In other words, in further part of this mapping I will make some arbitrary decisions (btw I see that this basement topic is really a thing in this data)

In [None]:
Quality2_map  = {'NaN':1,  'NA':1,'Po':2,'Fa':3,'TA':4,'Gd':5,'Ex':6}

CleanedCategorical['BsmtQual'] = CleanedCategorical['BsmtQual'].map(Quality2_map)
CleanedCategorical['BsmtCond'] = CleanedCategorical['BsmtCond'].map(Quality2_map)
CleanedCategorical['GarageQual'] = CleanedCategorical['GarageQual'].map(Quality2_map)
CleanedCategorical['GarageCond'] = CleanedCategorical['GarageCond'].map(Quality2_map)

Quality3_map  = {'NaN':1, 'NA':1,'No':2,'Mn':3,'Av':4,'Gd':5}

CleanedCategorical['BsmtExposure'] = CleanedCategorical['BsmtExposure'].map(Quality3_map)

Quality4_map  = {'NaN':1, 'NA':1,'Unf':2,'LwQ':3,'Rec':4,'BLQ':5,'ALQ':7,'GLQ':7}

CleanedCategorical['BsmtFinType1'] = CleanedCategorical['BsmtFinType1'].map(Quality4_map)
CleanedCategorical['BsmtFinType2'] = CleanedCategorical['BsmtFinType2'].map(Quality4_map)

Quality5_map  = {'NaN':1, 'Sal':1,'Sev':2,'Maj2':3,'Maj1':3,'Mod':4,'Min1':5,'Min2':5,'Typ':6}

CleanedCategorical['Functional'] = CleanedCategorical['Functional'].map(Quality5_map)

Quality6_map  = {'NaN':1, 'NA':1,'Unf':2,'RFn':3,'Fin':4}

CleanedCategorical['GarageFinish'] = CleanedCategorical['GarageFinish'].map(Quality6_map)

OrdinalVariables = ['ExterQual','ExterCond','HeatingQC','KitchenQual','BsmtQual','BsmtCond',
                    'GarageQual','GarageCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
                    'Functional','GarageFinish']

CleanedOrdinal = CleanedCategorical[OrdinalVariables]

#It's also the proper place where we should add our ordered interaction - luxurious interaction
CleanedOrdinal = pd.merge(CleanedOrdinal,
                 LuxuriousCategorical['Luxurious_Features'],
                 on='Id')
OrdinalVariables = ['ExterQual','ExterCond','HeatingQC','KitchenQual','BsmtQual','BsmtCond',
                    'GarageQual','GarageCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
                    'Functional','GarageFinish','Luxurious_Features']

CleanedOrdinal= CleanedOrdinal[OrdinalVariables].fillna(1)

CleanedOrdinal.head()

In [None]:
print(CleanedOrdinal['BsmtQual'].loc[[18]])

Ordering is finished.

Now we will check cardinality.

In [None]:
NominalVariables = [x for x in CategoricalVariables if x not in OrdinalVariables]

AllLevelsPerVar = CleanedCategorical[NominalVariables].nunique()
AllLevels = CleanedCategorical[NominalVariables].nunique().sum()

print(AllLevelsPerVar)
print("Number of all levels coming from nominal variables: " + str(AllLevels))

We want to apply 'hot encoding' so we needed to check the number of levels in our data set. Method 'hot encoding' is used to receive so-called dummies, variables which will correspond to just one level of particular variable. For example having variable "Apartment Type", our dummy feature will be equal to 1 only if particular record has certain class. This method has a lot of advantages, for instance you do not loose information. However, on other side it may lead to massive increase of your data set.

In the case of this data set it is not real danger, cause we have very little number of records and no variables which would have really a lot of levels.

In [None]:
CleanedCategoricalDummy = pd.get_dummies(CleanedCategorical[NominalVariables], columns=NominalVariables)

CleanedCategoricalDummy.head()

Correct, 180 variables present in the data set. Precisely, 180 dummy variables, in other words 180 binary features.

Let's make big cross-over and combine:
* Numeric variables (38 variables like at the beginning of the project)
* Ordinal variables (14 variables: 13 as transformation of 13 categorical ones + 1 interaction as a result of 5 features)
* Nominal variables (180 variables as dummy transfomration of 25 nominal variables)

In [None]:
CleanedTotal = pd.merge(CleanedNumeric,
                 CleanedOrdinal,
                 on='Id')

CleanedTotal = pd.merge(CleanedTotal,
                 CleanedCategoricalDummy,
                 on='Id')

CleanedTotal.head()

Alright, we cleaned the data. There are no missing values, ordinal changes were applied and dummy transformation was done for nominal variables. This is the end of the cleaning stage.

# 3. Data analysis

In this part, we focus on the data analysis. In other words, we will check the dependencies between features, their one- and multi-dimensional behaviour. Furthermore, we will check which variables may be most useful and which ones are the least interesting.

The important thing working with any data is scaling topic. This modificacation simply transforms numeric values to have 'scaled' values, in other words, it puts them in frame from 0 to 1. If we plotted it, we could see that shape of distribution is not changed, but only numbers on x axis. 

What variables should be scaled? Only numeric ones, cause nomial features are already binary (in their dummy form), and ordinal features have its own scale which is acceptable.

In [None]:
Target = ['IsTrain','SalePrice']
AllVariables = list(CleanedTotal.columns) 
NumericVariablesNoTarget = [x for x in NumericVariables if x not in Target]
AllVariablesNoTarget = [x for x in AllVariables if x not in Target]

ScaledCleanedTotal = CleanedTotal
ScaledCleanedTotal[NumericVariablesNoTarget] = minmax_scaling(CleanedTotal, columns=NumericVariablesNoTarget)

ScaledCleanedTotal.head()
#print(len(AllVariablesNoTarget)) = 230 = 232 - 2

It looks pretty.

Before we start antyhing, we have to split our data into two parts:
* training set
* test set

For this we will use the same structure as it was defined in basic data.

In [None]:
DataTrain=CleanedTotal[ScaledCleanedTotal.IsTrain==1]
DataTest=CleanedTotal[ScaledCleanedTotal.IsTrain==0]

All the data analysis will be done only with use of train data. First, let's look at sale price distribution

In [None]:
sns.distplot(DataTrain['SalePrice'])

Next, let's look at its moments. They will give some information about our target. First moment, mean is raw one and gives information about central tendency so just expected value (EV). Second moment, variance is central one and remains key dispersion measure (how much values differ from EV). Third moment, skewness *informs to what side distribution is skewed* . Fourth one, kurtosis will fullfil skewness bringing some information about tails.

In [None]:
TrainTargetMean = DataTrain['SalePrice'].mean()
TrainTargetVar = DataTrain['SalePrice'].var()
TrainTargetSkew = DataTrain['SalePrice'].skew()
TrainTargetKurt = DataTrain['SalePrice'].kurt()

print("Mean: " + str(round(TrainTargetMean)) + " with std: " + str(round(TrainTargetVar**(1/2))) + ", skewness: "
      + str(round(TrainTargetSkew,1))+ ", and kurtosis: "+ str(round(TrainTargetKurt,1)) +"."  )

Having some information about the target, we choose 3 interesting numeric variables to analyse whether they are correlated in any way with our target.

In [None]:
plt.figure(figsize=(18, 3))

plt.subplot(131)
plt.scatter('OverallQual', 'SalePrice',  data=DataTrain)
plt.subplot(132)
plt.scatter('GrLivArea', 'SalePrice',  data=DataTrain)
plt.subplot(133)
plt.scatter('YrSold', 'SalePrice',  data=DataTrain)


We observe that our sale price is linearly correalted with overal quality variable, what makes definitely sense. Similarly regarding the liveable area. What's interesting but not shocking, sale price is not really correlated with sold year. Looking in all this, we are sure that we need more global method to assess the correlation between target and data.

In this case, let's use most common method, correlation amtrix which presents the linear behaviour strength between factors.

In [None]:
CorrelationMatrix = DataTrain.corr()
fig, axe = plt.subplots(figsize=(15, 10))
sns.heatmap(CorrelationMatrix, vmax=.9, square=True);

Even though the graph is huge, this is not really informative. All in all we have 232 variables.

In [None]:
VarNo = 15
TopCorrelatedColumns = CorrelationMatrix.nlargest(VarNo, 'SalePrice')['SalePrice'].index
Reduced = np.corrcoef(DataTrain[TopCorrelatedColumns].values.T)
fig, axe = plt.subplots(figsize=(15, 10))

sns.heatmap(Reduced, vmax=.9, square=True,yticklabels=TopCorrelatedColumns.values, xticklabels=TopCorrelatedColumns.values, annot=True, annot_kws={'size': 10});

Top 15 strongest linear correlation between our target and features is positive. In other words, all these features growth lead to growth of our target. Does it make sense looking into variables' names? For variables like: 'overall quality', 'garage cars' or 'living area' definitely. Some of them like 'Year Built" are a bit surprising cause we didn't really see it in our graph. Above matrix will be relevant input for our further analysis.

We have cleaned and scaled data with defined linear correlation. This is good time for features selection. Variable: "VarNo"

In [None]:
# Number of features is coming from previous block (correlation matrix)
selector_F = SelectKBest(f_classif, k=VarNo)

# We do it on train data
Selected_F = selector_F.fit_transform(DataTrain[AllVariablesNoTarget], DataTrain['SalePrice'])

SelectedOrdered_F = pd.DataFrame(selector_F.inverse_transform(Selected_F), index=DataTrain.index, columns=AllVariablesNoTarget)

SelectedOrdered_F.head()

We can see that all dropped variables have entries set to 0. Let's focus only on non-zero ones.

In [None]:
SelectedVariables_F = list(SelectedOrdered_F.columns[SelectedOrdered_F.var() > 0])

# Get the valid dataset with the selected features.
DataTrain[SelectedVariables_F].head()
#print(DataTrain[SelectedVariables].shape) # 15 variables, 1460 records, alright

Ok, so using F statistic we receive these 15 variables as listed above. Reminder: this statistic assumes linearity so the score might underestimate the relation between a feature and the target if the relationship is nonlinear.

What is the big disadvantage of the above method (besides linearity assumption)? F statistic takes into account only one feature at the moment. Definitely, it doesn't find then the globally best features set. For this we have another methods, traditionally called norms or regulraizations. The great example of very useful application of this mathematical concept is LASSO (L1 regularization) which allows for finding 'optimal' (linear) solution for the set of features. Let's investigate it.

In [None]:
L1_par = 0.22 # This parameter is size of penalty (paradoxically, the lower the bigger penalty)

#Define parameters of LASSO
LogisReg = LogisticRegression(C=L1_par, penalty="l1", solver='liblinear', random_state=RandState).fit(DataTrain[AllVariablesNoTarget], DataTrain['SalePrice'])

#Fir model
LASSO = SelectFromModel(LogisReg, prefit=True)

#Apply model to the data
LASSO_transform = LASSO.transform(DataTrain[AllVariablesNoTarget])

#Restrcuture the data
SelectedOrdered_LASSO = pd.DataFrame(LASSO.inverse_transform(LASSO_transform), index=DataTrain[AllVariablesNoTarget].index,columns=DataTrain[AllVariablesNoTarget].columns)

#Choose relevant columns
SelectedVariables_LASSO = list(SelectedOrdered_LASSO.columns[SelectedOrdered_LASSO.var() > 0])

#Get the valid dataset with the selected features.
DataTrain[SelectedVariables_LASSO].head()

We had some fun with LASSO penalty parameter adjustment. Basically, its level decides how strict the algorithm is regarding features' importance. We decided to put it very low to limit the number of variables. Putting it at level 0.22, we receive only 15 variables (we should mention that the default level is 1, so 0.22 is quite low).

Are LASSO variables the same as for F statistic? Let's see

In [None]:
SelectedVariables = pd.DataFrame(SelectedVariables_F,columns=['F variables']).sort_values(by=['F variables'])
SelectedVariables['LASSO variables'] = SelectedVariables_LASSO

SelectedVariables

It's really interesting cause only "ExterQual" appears in both variables' sets. That's true, LASSO variables were very limited by big penalty, but still the algorithm produced completely different results than F statistic. LASSO is really powerful and we believe that it is superior to univariate method like F statistic. This list has to be more reliable (even assuming linearity). 

Look: LASSO thinks that our luxurious interaction is useful, nice.

# 4. Estimation

**Important remark:** This operation will be a bit surprising: we have to split our train data set to differentiate train and test for modeling. The so-called test set from data doesn't contain target so we would be not able to make evaluation on the basis of this. **

In [None]:
Target= DataTrain['SalePrice']
DataTrainFinal = DataTrain.drop(['SalePrice','IsTrain'],axis=1)
DataTestFinal = DataTest.drop(['SalePrice','IsTrain'],axis=1)

x_train,x_test,y_train,y_test = train_test_split(DataTrainFinal,Target,test_size=0.2,random_state=0)

print("Train set contains: " + str(x_train.shape[1]) + " variables in " + str(x_train.shape[0]) + " rows.")
print("Test set contains: " + str(x_test.shape[1]) + " variables in " + str(x_test.shape[0]) + " rows.")

First, let's prepare the data set containing predictions of our future models. The average will be the first one as the simplest possible 'model'. We will treat it as a type of benchmark for our prediction.

In [None]:
ModelAverage = y_train.mean()
print(str(round(ModelAverage)))

In [None]:
Predictions = pd.DataFrame(y_test,columns=['SalePrice'])
Predictions['ModelAverage'] = ModelAverage

ScoreAverage = math.sqrt(metrics.mean_squared_error(y_test, Predictions['ModelAverage']))

print('Average: RMSE = ' + str(ScoreAverage))
Predictions.head()

How we can see: average is very authentic and completely doesn't care, it looks always the same regardless the circumstances.

Alright, let's fit GLM (Generalised Linear Model). This is really useful model from linear family, which introduces link function to facilitate normality requirement. However, one of disadvantages is that we should suppose what distribution should be used. For this three classic distributions are proposed - first is just Gaussian one.

In [None]:
NormalReg = TweedieRegressor(power=0, alpha=0, link='identity')
PoissonReg = TweedieRegressor(power=1, alpha=0, link='log')
GammaReg = TweedieRegressor(power=2, alpha=0, link='log')

NormalReg.fit(x_train[SelectedVariables_LASSO],y_train)
PoissonReg.fit(x_train[SelectedVariables_LASSO],y_train)
GammaReg.fit(x_train[SelectedVariables_LASSO],y_train)

PredictNormalReg = NormalReg.predict(x_test[SelectedVariables_LASSO])
PredictPoissonReg = PoissonReg.predict(x_test[SelectedVariables_LASSO])
PredictGammaReg = GammaReg.predict(x_test[SelectedVariables_LASSO])

In [None]:
print('Normal Dist: RMSE = ' + str(math.sqrt(metrics.mean_squared_error(y_test, PredictNormalReg))))
print('Poisson Dist: RMSE = ' + str(math.sqrt(metrics.mean_squared_error(y_test, PredictPoissonReg))))
print('Gamma Dist: RMSE = ' + str(math.sqrt(metrics.mean_squared_error(y_test, PredictGammaReg))))
print('Poisson wins')

ScoreGLM = math.sqrt(metrics.mean_squared_error(y_test, PredictPoissonReg))

Predictions['GLM Poisson'] = PredictGammaReg
Predictions.head()

Let's check what random forest can do in this case.

In [None]:
RandomForest = RandomForestRegressor(random_state=RandState)
RandomForest.fit(x_train, y_train)
PredictRandomForest = RandomForest.predict(x_test)

ScoreRandomForest = math.sqrt(metrics.mean_squared_error(y_test, PredictRandomForest))

print('Random Forest: RMSE = ' + str(ScoreRandomForest))

In [None]:
Predictions['Random Forest'] = PredictRandomForest
Predictions.head()

Ok, not bad, already great improvement. But we will go further, this can be still boosted. Literally, we can apply extreme boosting for this.

In [None]:
XBoost_1 =XGBRegressor( booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.5, gamma=0,
             importance_type='gain', learning_rate=0.008, max_delta_step=0,
             max_depth=4, min_child_weight=1.5, n_estimators=4000, objective='reg:linear',
             reg_alpha=0.5, reg_lambda=0.5, scale_pos_weight=1, 
             silent=None, subsample=0.8, verbosity=1)

XBoost_1.fit(x_train, y_train)

PredictXBoost = XBoost_1.predict(x_test)

print('Extreme boosting for first try: RMSE = ' + str(math.sqrt(metrics.mean_squared_error(y_test, PredictXBoost))))

With these parameters, we achieve the RMSE = 26,949. This is not bad, but we can still improve it.

In [None]:
XBoost_final =XGBRegressor( booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.5, gamma=0,
             importance_type='gain', learning_rate=0.0081, max_delta_step=0,
             max_depth=4, min_child_weight=1.8, n_estimators=4200, objective='reg:linear',
             reg_alpha=0.6, reg_lambda=0.51, scale_pos_weight=1, 
             silent=None, subsample=0.8, verbosity=1)

XBoost_final.fit(x_train, y_train)

PredictXBoost_final = XBoost_final.predict(x_test)

ScoreXBoost = math.sqrt(metrics.mean_squared_error(y_test, PredictXBoost_final))

print('Final extreme boosting: RMSE = ' + str(ScoreXBoost))

We improved it just a bit. Let's continue with these results. It is not very surprising that extreme boosting performed better than earlier methods, it is well-known of its great predictive power.

In [None]:
Predictions['Extreme boosting'] = PredictXBoost_final
Predictions.head()

**Comparing used methods**

In [None]:
FinalRMSE = pd.DataFrame([[ScoreAverage],[ScoreGLM],[ScoreRandomForest],[ScoreXBoost]],columns=["RMSE"],index=['Expected value','GLM Poisson','Random Forest','Extreme boosting'])
FinalRMSE

Finally, we will fit the model on all training data we have. At the end, we will make prediction on the whole test set.

In [None]:
XBoost_final.fit(DataTrainFinal, Target)

FinalPrediction = XBoost_final.predict(DataTestFinal)

The results have to be prepared with matching order variables ("Id").

In [None]:
Submission = pd.DataFrame({'Id': test.index, 'SalePrice': FinalPrediction})

Submission.to_csv('Submission.csv', index=False)
Submission