# Housing Prices: Advanced Regression

After a deep dive into the data dictionary, rigorous Exploratory Data Analysis was done on the dataset. Following are the list of important parts in this notebook:

1. EDA
2. Box Cox Transformation
3. Feature Engineering
4. Light GBM
5. Lasso
6. LassoCV
7. Ridge
8. RidgeCV
9. XGBoost
10. AdaBoost
11. Gradient Boosting
12. Random Forest
13. Stacking
14. Blending Best Models


All comments are welcome, and please upvote if you find it helpful. Thank you

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing, model_selection, ensemble, metrics, tree, linear_model, kernel_ridge
import xgboost as xgb
import lightgbm as lgb
%matplotlib inline
plt.style.use('fivethirtyeight')
import sys, time
from scipy import special, stats
from mlxtend import regressor

In [None]:
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train.head(5)

# Exploratory Data Analysis

To dig deep into our dataset,let's first understand the structure of the dataset

In [None]:
train.describe()

In [None]:
train.info()

We can observe that there are 81 columns in our dataset which calls for some room for improvement. 

We will be particularly looking to merge some features and create some new out of the process. 

First let's create final training and test sets which will contain our new features and the merged old features, it will be a copy of the original training set

In [None]:
finalTrain = train.copy()
finalTest = test.copy()

Since the Sale Price is the column to predict and there are 81 columns, there are many ways to continue our EDA. First we will perform Outlier Analysis.

### Outlier Analysis

For this project, let's consider the columns with the highest correlation to the feature SalePrice. Let's take the features greater than 0.5 with respect to SalePrice.

In [None]:
corr = finalTrain.corr()['SalePrice'].sort_values(ascending=False)[:11]
corr

In [None]:
fig = plt.figure(figsize=[30,30])
for col,i in zip(corr.index,range(1,12)):
    axes = fig.add_subplot(6,2,i)
    sns.regplot(finalTrain[col],finalTrain.SalePrice,ax=axes)
plt.show()

Have an individual look at all the graphs here and try to determine and increasing or decreasing relationship between the features.

From these 10 plots we see that only **'GrLivArea'** has some outliers. Let's have a look at it individually.

In [None]:
fig = plt.figure(figsize=[12,6])
sns.regplot(finalTrain.GrLivArea, finalTrain.SalePrice).set_title('Ground Living Area vs Sale Price')

You can see an increasing relationship between GrLivArea and SalePrice but there are a few outliers where GrLivArea is very high for a lower price. Safe to say we can remove these two rows for which we are seeing such anomaly.

In [None]:
finalTrain = finalTrain[finalTrain['GrLivArea']<4600]

Now Let's look at our regplot for GrLivArea vs SalePrice again

In [None]:
fig = plt.figure(figsize=[12,6])
sns.regplot(finalTrain.GrLivArea, finalTrain.SalePrice).set_title('Ground Living Area vs Sale Price')

Much Better, as now, we can see a complete increasing relationship between the two features.

### Skewness Check in the Column to be Predicted

We need to be sure that in order to calculate accurate predictions, our column which is to be predicted should not be skewed. Hence, checking the distribution of the column SalePrice.

In [None]:
fig = plt.figure(figsize=[12,6])
sns.distplot(finalTrain.SalePrice, fit = stats.norm)

We see that our data is skewed, hence in order to bring it to normal values, we can apply log transformation over it in order to reduce the skewness.

In [None]:
finalTrain['SalePrice'] = np.log(1+finalTrain['SalePrice'].values)

fig = plt.figure(figsize=[12,6])
sns.distplot(finalTrain.SalePrice, fit = stats.norm)

### Removing Irrelevant or High Correlated Columns

Also, to add some simplicity to our analysis, we will perform the following steps:

- Remove ID columns from Training and Test Sets (Irrelevant for Predictive Modelling)
- Find correlation between different values and remove columns with high correlation with each other, as well as the SalePrice column.
- Remove SalePrice column from the training data and storing in a different variable.
- Merge Training and Test sets to perform the rest of the EDA in order to maintain consistency. We will spli them later before applying our regression models.

**Step 1**: Removing ID Columns

In [None]:

IDTrain = finalTrain['Id']
IDTest = finalTest['Id']

finalTrain.drop('Id', axis = 1, inplace = True)
finalTest.drop('Id', axis = 1, inplace = True)

**Step 2**: Performing Correlation Analysis as mentioned above.

In [None]:
fig = plt.figure(figsize = [30,20])
mask = np.zeros_like(finalTrain.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(finalTrain.corr(), cmap=sns.diverging_palette(150, 275, s=80, l=55, n=9), mask = mask, annot=True, center = 0)
plt.title("Correlation Matrix (HeatMap)", fontsize = 30)

In [None]:
unstackedCorrelations = pd.DataFrame({'Correlations' : finalTrain.corr().abs().unstack()})
unstackedCorrelations = unstackedCorrelations.reset_index().sort_values(by='Correlations')
unstackedCorrelations[(unstackedCorrelations.Correlations > 0.8) & (unstackedCorrelations.Correlations < 1)]

From the above table, we can draw the following conclusions:

- **GarageCars** and **GarageArea** are highly correlated to each other, and from the heatmap, both are highly correlated to the **SalePrice**. Hence, removing **GarageArea** from our analysis since it adds redundancy.
- **TotRmsAbvGrd** and **GrLivArea** are highly correlated to each other, and from our previous analysis to remove outliers from **GrLivArea**, we observed **GrLivArea** and **SalePrice** are highly correlated to each other. Hence, we remove **TotRmsAbvGrd** from our analysis since it adds redundancy.
- Similar analysis and assumptions are applicable to **GarageYrBlt** as well. Therefore, removing it from our data to avoid redundancy.
- **1stFlrSF** and **2ndFlrSF** are being kept intact as they will be added to feature engineering (in the later sections of this notebook)
- **SalePrice** and **OverallQual** are highly related to each other and should not be disturbed.

In [None]:
finalTrain.drop('GarageArea', axis = 1, inplace = True)
finalTrain.drop('TotRmsAbvGrd', axis = 1, inplace = True)
finalTrain.drop('GarageYrBlt', axis = 1, inplace = True)

finalTest.drop('GarageArea', axis = 1, inplace = True)
finalTest.drop('TotRmsAbvGrd', axis = 1, inplace = True)
finalTest.drop('GarageYrBlt', axis = 1, inplace = True)

**Step 3**: Removing SalePrice column

In [None]:
SalePriceTrain = finalTrain['SalePrice']

finalTrain.drop('SalePrice', axis = 1, inplace = True)

**Step 4**: Merging finalTrain and finalTest

In [None]:
print(finalTrain.shape)
print(finalTest.shape)

finalData = pd.concat([finalTrain,finalTest])

print(finalData.shape)

### Missing Data Handling

For the next part, we will consider how to handle NaNs for this project.

In [None]:
DataMissing = finalData.isnull().sum()*100/len(finalData)
DataMissingByColumn = pd.DataFrame({'Percentage Nulls':DataMissing})
DataMissingByColumn.sort_values(by='Percentage Nulls',ascending=False,inplace=True)
DataMissingByColumn[DataMissingByColumn['Percentage Nulls']>0]

We can see the columns 'PoolQC', 'MiscFeature', 'Alley' and 'Fence' are almost nulls, hence we will remove those columns as our basic assumption is that they do not add to the predictions much.

In [None]:
finalData.drop('PoolQC', axis = 1, inplace=True)
finalData.drop('MiscFeature', axis = 1, inplace=True)
finalData.drop('Alley', axis = 1, inplace=True)
finalData.drop('Fence', axis = 1, inplace=True)

In order to handle NAs, we will replace the nulls in the columns of datatype 'object' with the mode of the respective column, whereas for the columns for datatypes 'integer', or 'float', we will replace the nulls with the median of the respective column.

In [None]:
objData = finalData.select_dtypes("object")

for i in objData:
    finalData[i].fillna(finalData[i].mode()[0], inplace = True)

intData = finalData.select_dtypes(["int64","float64"])

for i in intData:
    finalData[i].fillna(finalData[i].median(), inplace = True)

Rechecking for Nulls to make sure our piece above worked.

In [None]:
DataMissing = finalData.isnull().sum()*100/len(finalData)
DataMissingByColumn = pd.DataFrame({'Percentage Nulls':DataMissing})
DataMissingByColumn.sort_values(by='Percentage Nulls',ascending=False,inplace=True)
DataMissingByColumn[DataMissingByColumn['Percentage Nulls']>0]

Therefore, there are no nulls remaining. Hence, we are in a good position to move to the next step in our analysis.

# Feature Engineering

Let's deep dive into our dataset again. We can see 4 columns related to Bath:

- BsmtFullBath
- BsmtHalfBath
- FullBath
- HalfBath

We can create a new feature from these 4 columns to 2 by using the process where Half Bath = 1/2 * Full Bath and adding it to the Full Bath columns.

In [None]:
finalData['TotalBaths'] = finalData['BsmtFullBath'] + 0.5 * finalData['BsmtHalfBath'] + finalData['FullBath'] + 0.5 * finalData['HalfBath']

We can create another feature where we can monitor the age of house from its selling date to the last time it was remodelled.

In [None]:
finalData['HouseAge'] = finalData['YrSold'] - finalData['YearRemodAdd']

After diving into our dataset more, another feature we can create is the Total Number of Porches by combining the following columns:

- OpenPorchSF	
- EnclosedPorch	
- 3SsnPorch	
- ScreenPorch

In [None]:
finalData['TotalPorches'] = finalData['OpenPorchSF'] + finalData['EnclosedPorch'] + finalData['3SsnPorch'] + finalData['ScreenPorch']

The 4th feature which we will be adding is the Season feature. As you can see the MoSold column, which suggests the Month the house was sold. But since the column is an integer from 1 to 12, it can be interpreted as levels during modelling. Hence, we group those values as:

- 10,11,12,1,2,3 - Winter
- 4,5,6,7,8,9 - Summer

In [None]:
finalData['Season'] = np.where(finalData['MoSold'].isin([10,11,12,1,2,3]),'Winter','Summer')

The 5th Feature that we will make is the overall average rating of the house to determine its price. We will do that by taking the arithmetic average of the **OverallQual** and **OverallCond**

In [None]:
finalData['OverallRate'] = 0.5 * (finalData['OverallQual'] + finalData['OverallCond'])

Creating two more features of the total number of square foots in a house which will be the sum of basement, 1st floor and the 2nd floor.

In [None]:
finalData['TotalSF'] = finalData['TotalBsmtSF'] + finalData['1stFlrSF'] + finalData['2ndFlrSF']

In [None]:
print(finalData.shape)

### Skewness Check in all the columns


Now checking the skewness for all the columns integer columns. We are doing this practice in order to reduce the skewness in our dataset. We need the values to be more normal in order to make our analyses simpler. Box Cox Transformation is one such method of doing this which reduces the skewness of our data & helps us conduct a variety of tests. More of this can be found [here](https://www.statisticshowto.com/box-cox-transformation/#:~:text=A%20Box%20Cox%20transformation%20is,a%20broader%20number%20of%20tests.)

In [None]:
nonObjectColList = finalData.dtypes[finalData.dtypes != 'object'].index

skewMeasure = finalData[nonObjectColList].apply(lambda x: stats.skew(x.dropna())).sort_values(ascending = False)

skewMeasure=pd.DataFrame({'skew':skewMeasure})

skewMeasure = skewMeasure[abs(skewMeasure)>0.5].dropna()

skewMeasure


Here we can see there are skewness in many columns. Hence, we will apply Box Cox transformation to all these columns in order to reduce their skewness. We would consider lambda as 0, to apply log transformation.

In [None]:
for i in skewMeasure.index:
    finalData[i] = special.boxcox1p(finalData[i],0.15) #lambda = 0.15

Since, BoxCox Transformation has reduced the skewness in our data. We move on the next step of our notebook.

### One Hot Encoding

Finally, we convert all the remaining categorical columns into dummies i.e. One Hot Encoding to make our analysis simpler and better for the machine to understand in order to perform predictive modelling.

In [None]:
finalData = pd.get_dummies(finalData)

print(finalData.shape)

Now, we plan to remove the columns with a very low Standard Deviation in its values. As a lower SD will only increase the complexity to our predictive modelling. Removing columns with SD below 0.025.

In [None]:
stdDev = finalData.std().sort_values()
removeList = stdDev[stdDev < 0.025]

for i in removeList.index:
    finalData.drop(i, axis = 1, inplace = True)
print(finalData.shape)

Converting all columns into numerical values makes our life a bit easier when performing predictive modelling. Now we are ready to implement a few models on this training set.

# Model Training

## Train-Test Split

Splitting our Training and Test in 70-30 proportions.

In [None]:
trainDF = finalData[:len(finalTrain)]
testDF = finalData[len(finalTest)-1:]
print(trainDF.shape)
print(testDF.shape)

In [None]:
xTrain, xTest, yTrain, yTest = model_selection.train_test_split(trainDF.to_numpy(),SalePriceTrain.to_numpy(),test_size=0.2,random_state=1010)

## LightGBM

**Hyperparameter Tuning**

In [None]:
lgbAttributes = lgb.LGBMRegressor(objective='regression', n_jobs=-1, random_state=1010)
                               

lgbParameters = {
    'max_depth' : [2,3,5],
    'learning_rate': [0.01,0.05, 0.1],
    'colsample_bytree' : [1,1.1,1.2,1.3,1.5],
    'n_estimators' : [480,600,720],
    'num_leaves' : [4,5,6,7],
    'max_bin' : [50,100,150,200],
    'verbose' : [-1],
    'bagging_seed' : [7],
    'bagging_freq' : [3,5,7,9],
    'bagging_fraction' : [0.7,0.8,0.9],
    'feature_fraction' : [0.2319,0.25,0.27,0.3,0.33],
    'feature_fraction_seed' : [7],
    'min_data_in_leaf' : [2,3,4,5,6],
    'min_sum_hessian_in_leaf' : [15,16,18,20]
}


In [None]:
lgbModel = model_selection.RandomizedSearchCV(lgbAttributes, param_distributions = lgbParameters, cv = 5, random_state=1010)

start = time.time()
lgbModel.fit(xTrain,yTrain.flatten())
end = time.time()

print('Training took {:.2f} mins'.format((end-start)/60))

lgbPred = lgbModel.predict(xTest)

In [None]:
lgbModel.best_estimator_

In [None]:
LGBMMetrics = pd.DataFrame({'Model': 'LightGBM', 
                            'MSE': metrics.mean_squared_error(yTest, lgbPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, lgbPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, lgbPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, lgbPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, lgbPred)),
                            'R-Square' : metrics.r2_score(yTest, lgbPred)},index=[1])

LGBMMetrics

In [None]:
fig = plt.figure(figsize=[12,6])
sns.regplot(lgbPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

#### Actual vs Predicted

In [None]:
LGBMAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(lgbPred)})
LGBMAvP

## Lasso & LassoCV

In [None]:
lasAttributes = linear_model.Lasso(alpha=0.0005, max_iter = 2000, random_state=1010)
lasAttributesCV = linear_model.LassoCV(max_iter = 2000, cv=5, verbose=-1, random_state=1010, n_jobs=-1)

- **LASSO**

In [None]:
start = time.time()
lasAttributes.fit(xTrain,yTrain.flatten())
end = time.time()

print('Training for Lasso took {:.2f} mins'.format((end-start)/60))

lasPred = lasAttributes.predict(xTest)

In [None]:
LASMetrics = pd.DataFrame({'Model': 'Lasso', 
                            'MSE': metrics.mean_squared_error(yTest, lasPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, lasPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, lasPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, lasPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, lasPred)),
                            'R-Square' : metrics.r2_score(yTest, lasPred)},index=[2])

LASMetrics

In [None]:
plt.figure(figsize=[12,6])
sns.regplot(lasPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
LASAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(lasPred)})
LASAvP

- **LASSOCV**

In [None]:
start = time.time()
lasAttributesCV.fit(xTrain,yTrain.flatten())
end = time.time()

print('Training for LassoCV took {:.2f} mins'.format((end-start)/60))

lasCVPred = lasAttributesCV.predict(xTest)

In [None]:
LASCVMetrics = pd.DataFrame({'Model': 'LassoCV', 
                            'MSE': metrics.mean_squared_error(yTest, lasCVPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, lasCVPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, lasCVPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, lasCVPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, lasCVPred)),
                            'R-Square' : metrics.r2_score(yTest, lasCVPred)},index=[3])

LASCVMetrics

In [None]:
plt.figure(figsize=[12,6])
sns.regplot(lasCVPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
LASCVAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(lasCVPred)})
LASCVAvP

## Ridge & RidgeCV

In [None]:
ridAttributes = linear_model.Ridge(alpha=5, max_iter=2000, random_state=1010)
ridCVAttributes = linear_model.RidgeCV(alphas=(1,5,10), cv=5)

- **RIDGE**

In [None]:
start = time.time()
ridAttributes.fit(xTrain,yTrain.flatten())
end = time.time()

print('Training for Ridge took {:.2f} mins'.format((end-start)/60))

ridPred = ridAttributes.predict(xTest)

In [None]:
RIDMetrics = pd.DataFrame({'Model': 'Ridge', 
                            'MSE': metrics.mean_squared_error(yTest, ridPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, ridPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, ridPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, ridPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, ridPred)),
                            'R-Square' : metrics.r2_score(yTest, ridPred)},index=[4])

RIDMetrics

In [None]:
plt.figure(figsize=[12,6])
sns.regplot(ridPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
RIDAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(ridPred)})
RIDAvP

- **RidgeCV**

In [None]:
start = time.time()
ridCVAttributes.fit(xTrain,yTrain.flatten())
end = time.time()

print('Training for RidgeCV took {:.2f} mins'.format((end-start)/60))

ridCVPred = ridCVAttributes.predict(xTest)

In [None]:
RIDCVMetrics = pd.DataFrame({'Model': 'RidgeCV', 
                            'MSE': metrics.mean_squared_error(yTest, ridCVPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, ridCVPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, ridCVPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, ridCVPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, ridCVPred)),
                            'R-Square' : metrics.r2_score(yTest, ridCVPred)},index=[5])

RIDCVMetrics

In [None]:
plt.figure(figsize=[12,6])
sns.regplot(ridCVPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
RIDCVAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(ridCVPred)})
RIDCVAvP

## XGBoost

**Hyperparameter Tuning**

In [None]:
xgbAttributes = xgb.XGBRegressor(n_jobs=-1, random_state=1010)

xgbParameters = {
      
    'max_depth' : [2,3],
    'learning_rate': [0.01,0.05, 0.1],
    'colsample_bytree' : [0.4,0.6,0.8],
    'n_estimators' : [1000,2000],
    'gamma' : [0.15,0.3,0.5],
    'subsample': [0.6,0.7,0.8], #,0.9,1
    'min_child_weight': [3,4,5],#6,10
    'scale_pos_weight': [10,20],
    'reg_alpha' : [0.5,0.75],
    'reg_lambda' : [0.5,0.75],
    #'num_leaves' : [3,4],
    'max_bin' : [200],
}

xgbModel = model_selection.RandomizedSearchCV(xgbAttributes, param_distributions = xgbParameters, cv = 5, random_state=1010)

start = time.time()
xgbModel.fit(xTrain,yTrain.flatten())
end = time.time()

print('Training for XGBoost took {:.2f} mins'.format((end-start)/60))

xgbPred = xgbModel.predict(xTest)

In [None]:
XGBMetrics = pd.DataFrame({'Model': 'XGBoost', 
                            'MSE': metrics.mean_squared_error(yTest, xgbPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, xgbPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, xgbPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, xgbPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, xgbPred)),
                            'R-Square' : metrics.r2_score(yTest, xgbPred)},index=[6])

XGBMetrics

In [None]:
plt.figure(figsize=[12,6])
sns.regplot(xgbPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
XGBAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(xgbPred)})
XGBAvP

## AdaBoost

**Hyperparameter Tuning**

In [None]:
adaAttributes = ensemble.AdaBoostRegressor(base_estimator = tree.DecisionTreeRegressor(max_depth=5), random_state = 1010)

adaParameters = {
    'learning_rate':[0.05,0.1],
    'n_estimators' : [800,1600]
}

adaModel = model_selection.RandomizedSearchCV(adaAttributes, param_distributions = adaParameters, cv = 5, random_state=1010)

start = time.time()
adaModel.fit(xTrain,yTrain.flatten())
end = time.time()

print('Training for AdaBoost took {:.2f} mins'.format((end-start)/60))

adaPred = adaModel.predict(xTest)

In [None]:
ADAMetrics = pd.DataFrame({'Model': 'AdaBoost', 
                            'MSE': metrics.mean_squared_error(yTest, adaPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, adaPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, adaPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, adaPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, adaPred)),
                            'R-Square' : metrics.r2_score(yTest, adaPred)},index=[7])

ADAMetrics

In [None]:
plt.figure(figsize=[12,6])
sns.regplot(adaPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
AdaAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(adaPred)})
AdaAvP

## Gradient Boosting

**Hyperparameter Tuning**

In [None]:
grbAttributes = ensemble.GradientBoostingRegressor(random_state=1010)

grbParameters = {
    'n_estimators': [5000,6000],
    'max_depth' : [3,4,5],
    'learning_rate' : [0.01,0.05,0.1],
    'max_features' : ['sqrt'],
    'loss' : ['huber'],
    'min_samples_leaf' : [10,15],
    'min_samples_split' : [10,15]
}

grbModel = model_selection.RandomizedSearchCV(grbAttributes, param_distributions = grbParameters, cv=5, random_state=1010)

start = time.time()
grbModel.fit(xTrain,yTrain.flatten())
end = time.time()

print('Training for Gradient Boosting took {:.2f} mins'.format((end-start)/60))

grbPred = grbModel.predict(xTest)

In [None]:
GRBMetrics = pd.DataFrame({'Model': 'Gradient Boosting', 
                            'MSE': metrics.mean_squared_error(yTest, grbPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, grbPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, grbPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, grbPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, grbPred)),
                            'R-Square' : metrics.r2_score(yTest, grbPred)},index=[8])

GRBMetrics

In [None]:
plt.figure(figsize=[12,6])
sns.regplot(grbPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
GrbAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(grbPred)})
GrbAvP

## Random Forest

**Hyperparameter Tuning**

In [None]:
rfAttributes = ensemble.RandomForestRegressor(n_jobs = -1, random_state=1010)

rfParameters = {
    'n_estimators' : [1000,1500],
    'max_depth' : [5,10,15],
    'min_samples_leaf' : [4,5],
    'min_samples_split' : [5,10],
    'oob_score' : [True]
}

rfModel = model_selection.RandomizedSearchCV(rfAttributes, param_distributions = rfParameters, cv=5, random_state = 1010)

start = time.time()
rfModel.fit(xTrain, yTrain.flatten())
end = time.time()

print('Training for Random Forest took {:.2f} mins'.format((end-start)/60))

rfPred = rfModel.predict(xTest)

In [None]:
RFMetrics = pd.DataFrame({'Model': 'Random Forest', 
                            'MSE': metrics.mean_squared_error(yTest, rfPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, rfPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, rfPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, rfPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, rfPred)),
                            'R-Square' : metrics.r2_score(yTest, rfPred)},index=[9])

RFMetrics

In [None]:
plt.figure(figsize=[12,6])
sns.regplot(rfPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
rfAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(rfPred)})
rfAvP

## Stacking

In [None]:
stckAttributes = regressor.StackingCVRegressor(regressors = (lasAttributes, lgbModel, lasAttributesCV, ridAttributes, ridCVAttributes),
                                               random_state = 1010,
                                               meta_regressor = lgbModel, use_features_in_secondary = True)

start = time.time()
stckAttributes.fit(xTrain,yTrain.flatten())
end = time.time()

print('Training for Stacking took {:.2f} mins'.format((end-start)/60))

stckPred = stckAttributes.predict(xTest)

In [None]:
STCKMetrics = pd.DataFrame({'Model': 'Stacking', 
                            'MSE': metrics.mean_squared_error(yTest, stckPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, stckPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, stckPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, stckPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, stckPred)),
                            'R-Square' : metrics.r2_score(yTest, stckPred)},index=[10])

STCKMetrics

In [None]:
plt.figure(figsize=[12,6])
sns.regplot(stckPred,yTest,truncate=False)
plt.title('Actual vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
stckAvP = pd.DataFrame({'Actual': np.exp(yTest), 'Predicted': np.exp(stckPred)})
stckAvP

In [None]:
frames = [LGBMMetrics,LASMetrics,LASCVMetrics,RIDMetrics,RIDCVMetrics,XGBMetrics,ADAMetrics,GRBMetrics,RFMetrics,STCKMetrics]
TrainingResult = pd.concat(frames)
TrainingResult

From the above table, we see that Lasso has the least RMSLE. But when submitted, the stacking classifier showed better performance score.

## Blending the Models

In [None]:
def blend(X):
    return((0.3*stckAttributes.predict(X)) + (0.5*lasAttributes.predict(X)) + (0.1*lgbModel.predict(X)) + (0.1*ridCVAttributes.predict(X)))

blendedPred = blend(xTest)

In [None]:
BlendMetrics = pd.DataFrame({'Model': 'Blend', 
                            'MSE': metrics.mean_squared_error(yTest, blendedPred),
                            'RMSE' : np.sqrt(metrics.mean_squared_error(yTest, blendedPred)),
                            'MAE' : metrics.mean_absolute_error(yTest, blendedPred),
                            'MSLE' : metrics.mean_squared_log_error(yTest, blendedPred), 
                            'RMSLE' : np.sqrt(metrics.mean_squared_log_error(yTest, blendedPred)),
                            'R-Square' : metrics.r2_score(yTest, blendedPred)},index=[11])

BlendMetrics

In [None]:
SalePrice = stckAttributes.predict(testDF.to_numpy())
SalePrice = np.exp(SalePrice)

blendedPrice = blend(testDF.to_numpy())
blendedPrice = np.exp(blendedPrice)

toSubmit = pd.DataFrame({'Id' : IDTest, 'SalePrice' : blendedPrice})

toSubmit.to_csv('submission.csv', index=False) 