## House Price Predictions
Different variables come into play in appraising a house, such as the number of bedrooms, square footage, location, and much more. So, our task here is to build a machine learning model to make reasonably accurate predictions in terms of pricing houses. It would be an opportunity for those in real estate to gain more visibility on the market as a whole. In doing so, this notebook will offer a user-friendly explanation through every step using LIME (Local Interpretable Model-agnostic Explanations) principles.

## Table of Contents
1. Environment set-up
    * Importing Libraries
    * Loading the data
2. Initial Diagnostics
    * Glimpse
    * Descriptive Statitics
    * Target Variable Analysis
    * Predictors Analysis
3. Data Cleaning
    * Missing Values
    * Simple Imputation
    * Grouped Imputation
4. Inquiry Exploration
    * Does bigger means pricier houses?
    * Where is the real estate hotspot?
    * Which miscellaneous feature add the most value?
5. Feature Engineering
    * Outliers - Feature Scaling
    * Categorical Encoding
    * Datetime Variables
    
6. Correlation Analysis

7. Machine Learning set-up
    * Train-test split
    * Cross-validation
    * Dimensionality Reduction

8. Machine Learning - Simple Models

9. Machine Learning - Ensemble Methods

10. Hyperparameter Tuning

11. Model Performance Evaluation
 
12. Final Submission

# 1. Environment Set-up

In [None]:
## Importing Libraries
import warnings
warnings.filterwarnings("ignore")

#Set seed
import random
random.seed(1234)

# Manipulating & Visualizing Data
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(16,10)})

# Feature Scaling
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Categorical Encoding
import category_encoders as ce

# Model Selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold

# Dimensionality Reduction
from sklearn.decomposition import PCA, TruncatedSVD

# ML Models
from sklearn.tree import DecisionTreeRegressor
from sklearn import linear_model 

# Ensemble Learning
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import StackingRegressor

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Performance metrics
import sklearn.metrics as skm

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
## Loading the dataset
df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
df.head()

# 2. Initial Diagnostics

In [None]:
## Glimpse of the data
df.info()

**Takeaway:** From the glimpse above, we could already draw some observations. 
* Our dataset comprises 1460 rows and 80 columns, making it relatively small, so we would not expect the training process o to be as computationally intensive.
* For null values, most columns appear to have no missing values, while null values make up 80% for some of those variables. It indicates that we shall proceed with data cleaning and tidying before doing any statistical analysis or machine learning. 
* In terms of variable type, we have mostly int64, float64, and object. Though 'object' can indicate text or categorical, we will need to investigate further in feature engineering.

In [None]:
## Descriptive Statistics
df.describe()

**Takeaway:** For all 80 variables, the table above captures the basic descriptive statistics showing things like mean, standard deviation, min, max, etc. Commenting on each variable would bring little value to our overall analysis, and so we will zoom on the target variable 'SalePrice'.

In [None]:
# Stats for the target variable
df['SalePrice'].describe()

**Takeaway:** The count indicates no null values in the column. The houses in the dataset vary from ~USD34.9k to ~USD755k, with a mean value of ~USD180k. With the standard deviation at ~USD79k, it appears that prices fluctuate pretty significantly, or we may potentially have houses with exorbitant prices (outliers) skewing the data. We will create a histogram to look at the distribution more closely.

In [None]:
## Feature Variable Analysis
sns.histplot(data=df, x='SalePrice')
plt.xlabel("Dollar Amount ($)")
plt.ylabel("Frequency (Count)")
plt.title("Distribution of House Sale Price")
plt.show()

**Takeaway:** From the histogram above, we can deduct that house sale prices in this dataset have a right-skewed distribution with outliers on the upper end, indicating luxury houses with higher price points. However, most houses appear to fall between ~USD100k and ~USD300k, relatively consistent with real estate markets in the United States.

# 3. Data Cleaning

In [None]:
# Visualize missing data
plt.figure(figsize=(10,6))
sns.heatmap(df.isna().transpose(),
            cmap="YlGnBu",
            cbar_kws={'label': 'Missing Data'})
plt.xlabel("Features")
plt.ylabel("Observations")
plt.show()

**Takeaway:** As the plot shows above, there are indeed null values confirming our observation in the initial diagnostics. Given that not all variables are of the same type or the same proportion of missing values, the cleaning process will attend to each column or group of similar columns.

**Definition:** When it comes to data science, we are constantly dealing with imperfect information, thus murking the waters on the quality of data overall. One of those issues is the recurrence of missing values and requires effective techniques to deal with them. Imputation methods present such an opportunity using strategies to replace null values with statistical measures like mean, mode, or median. More information [here](https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/).

In [None]:
## No. of null values
null_vals = df.isna().sum().sum()

# List of columns with missing values
null_cols = df.columns[df.isna().any()].tolist()

# Reporting back
print("We are missing {:2d} values in our data at given percentages in the following columns:" .format(null_vals))
for i in null_cols:
    col_null = df[i].isnull().sum()
    per_null = col_null / len(df[i])
    print("  - {}: {} ({:.2%})".format(i, col_null, per_null))

**LotFrontage:** As per the data dictionary, it is the linear feet of street connected to property. It indicates the measurement of a piece of land (lot) often defined by frontage and depth respectively. For instance, an house can be 50 by 150, meaning 50 feet wide (frontage) and 150 feet long. Read more about it [here](https://www.gimme-shelter.com/frontage-50043/). Given that 'LotFrontage' is one of those characteristics all houses have, the null values indicate missing information that cannot just be equal to 0. Since we cannot get back and fetch more data, we will use imputation methods for this column and other ones which may require them.

**Note:** Before proceeding to the imputation, we would like to investigate possible differences in distribution grouped by Lot shape.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.suptitle('Boxplots of LotFrontage')
sns.boxplot(ax=ax1, data=df, y="LotFrontage", orient = "v")
sns.boxplot(ax=ax2, data=df, x="LotShape", y="LotFrontage", orient = "v")
plt.show()

In [None]:
print("For all houses' LotFrontage, the mean is {:.2f} and median is {:.2f}".format(df['LotFrontage'].mean(),
                                                               df['LotFrontage'].median()))

In [None]:
print("For houses that are: ")
for i in df["LotShape"].unique().tolist():
    df_i = df[df["LotShape"]==i]
    mean_frontage = df_i['LotFrontage'].mean()
    median_frontage = df_i['LotFrontage'].median()
    print(" -{}, mean LotFrontage = {:.2f} and median LotFrontage = {:.2f}".format(i,
                                                                            mean_frontage,
                                                                            median_frontage))

**Takeaway:** The boxplots indicate the presence of outliers in the data with massive and small houses by widths. When broken down by 'LotShape', we also observe a notable difference in those houses categorized as IR3, in other words, of very irregular shape. In light of both the outliers and category differences, we will use the median value grouped by LotShape for the imputation process to ensure consistency in the data.

In [None]:
# Imputation using group by
df['LotFrontage'] = df.groupby('LotShape').LotFrontage.transform(lambda x: x.fillna(x.median()))
df.LotFrontage = df.LotFrontage.round(2)
df['LotFrontage'].isnull().sum()
df.head()

**Alley:** As per the data dictionary, it refers to the type of alley access to property. Given the real estate market in question, it may affect the price more or less and so, the null values are indeed significant with NA indicating that there isn't one. To ensure that it is taken into account, we will rename the NA into the full phrase 'No alley access' and then proceed in encoding this categorical variable.

In [None]:
# Replacing the null values with a significant term
df['Alley'].fillna("No alley access", inplace = True)
df['Alley'].value_counts()

**Variable Grouping:** It appears that the process in detecting missing valuies actually led to understanding those null values are actually categories significant or equal to 0 per the data dictionary. So, to be more efficient, we will make a list of those columns and the term/value we'll use to replace the na values.

In [None]:
for i in null_cols:
    # Grouping of variables dependent on the presence of a basement
    if 'Bsmt' in i:
        df[i].fillna("No Basement", inplace = True)
        
    # Grouping of variables dependent on the presence of a garage
    elif 'Garage' in i:
        if i == 'GarageYrBlt':
            df[i].fillna(0, inplace = True)
        else:
            df[i].fillna("No Garage", inplace = True)

In [None]:
other_cols_imp = {
    'MasVnrType': 'No Veneer',
    'MasVnrArea': 0, 
    'FireplaceQu': 'No Fireplace', 
    'PoolQC': 'No Pool', 
    'Fence': 'No Fence', 
    'MiscFeature': 'No Misc'
   }

# Grouping of variables dependent on the presence of other amenities
for i, j in other_cols_imp.items():
    df[i].fillna(j, inplace = True)

**Note:** Assuming all houses have an electrical system, we will drop the obersvation having the eltrical system as a null values. 

In [None]:
# Deleting the Electrical 
df.dropna(subset=['Electrical'], inplace=True)

In [None]:
## No. of null values
null_vals = df.isna().sum().sum()

# Reporting back
print("Afer imputation, we have missing {:d} values in our data.".format(null_vals))

# 4. Inquiry Exploration

In this section, we will generate various questions to further consolidate our understanding of the problem at hand. It will allow us to guide the machine learning process more attuned to the particular subject matter.

**Question 1:** Do bigger houses always translate into higher prices?

In [None]:
## Scatterplot between lotArea and SalePrice
sns.scatterplot(data=df, x='LotArea', y='SalePrice')
plt.show()

**Takeaway:** From the scatterplot above, there is very little evidence indicating that bigger houses are ultimiately pricier. As noted in the diagnostics, the 80 initial variables show how the house valuation process is multi-dimensional.

**Question 2:** Where is the real estate hotspot?

In [None]:
# Which neighborhood registers the most sales?
total = df['Neighborhood'].value_counts()[0]
per = df['Neighborhood'].value_counts(normalize=True)[0]
neigh_name = pd.DataFrame(df['Neighborhood'].value_counts()).index[0]
print("{} has the most houses sales with {} making up {:.2%} of all sales.".format(neigh_name, 
                                                                                  total, per))

In [None]:
# Which neighborhood registers the sales with the highest price tags?
df_grouped = pd.DataFrame(df.groupby('Neighborhood')['SalePrice'].sum())
df_sorted = df_grouped.sort_values('SalePrice', ascending=False)
df_sorted['per_total'] = (df_sorted['SalePrice'] / df_sorted['SalePrice'].sum())

neigh_name = df_sorted.index[0]
total = df_sorted['SalePrice'][0]
per = df_sorted['per_total'][0]
print("{} has the highest cumulative sales amount of ${:,} making up {:.2%} of all transactions.".format(
                                                                                                    neigh_name, 
                                                                                                    total, per))

**Note:** As per the data dictionary, NAmes refers to Iowan city of North Ames. 

**Question 3:** What miscellaneous feature add the most value?

In [None]:
# Which miscellaneous feature is the most prevalent?
total = df['MiscFeature'].value_counts()[1]
misc_name = pd.DataFrame(df['MiscFeature'].value_counts()).index[1]
print("For houses with miscellaneous features, {} is the most prevalent in {} houses.".format(misc_name, total))

In [None]:
# Calculating the value added
misc = df[df['MiscFeature'] == 'Shed']['MiscVal']
sale = df[df['MiscFeature'] == 'Shed']['SalePrice']
avg_value_added = np.average(misc)
per_sale = np.average(misc/sale)
print("{} brings ${:.2f} of value added making {:.2%} of the house sale price on average.".format(
                                                                                                misc_name, 
                                                                                                avg_value_added,
                                                                                                per_sale))

# 5. Feature Engineering

**Feature Scaling:** When dealing with data, we are working with different types of which required adpated pre-processing before applying any machine learning techniques. In our content, we perform feature scaling to standardize only the values in continuous numerical variables. Read more [here](https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35).

In [None]:
# Filter numeric columns
num_vars = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1',
           'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
           'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
            'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']

scaler = StandardScaler().fit(df[num_vars].values)
df[num_vars] = scaler.transform(df[num_vars].values)
df.head()

**Categorical feature encoding** ensures that variables with categories/groupings are transformed into numerical inputs for the predictive modeling phase. The categorical variables are also subdivided as:
- binary (two possible outcomes)
- cardinal (no meaningful order) 
- ordinal (meaningful order) 

Read more [here](https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/).

In [None]:
# Encoding binary categorical variables
binary = ['CentralAir']

# Applying binary encoder
binenc = ce.BinaryEncoder(cols = binary, return_df = True)
bin_df = binenc.fit_transform(df)  
bin_df.head()

In [None]:
# List of nominal categorical variables
cardinal = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotConfig', 
            'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 
            'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 
            'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 'Electrical', 
            'Functional', 'GarageType', 'MiscFeature', 'SaleType', 
            'SaleCondition']


# Applying one-hot encoder 
ohe = ce.OneHotEncoder(cols = cardinal, use_cat_names=True, return_df = True)
df_card_enc = ohe.fit_transform(bin_df)  
df_card_enc.head()

In [None]:
# Encoding cardinal categorical variables
ordinal_cols_mapping = [ 
    {"col" : 'LotShape', "mapping": {'Reg':0, 'IR1': 1, 'IR2':2, 'IR3':3}},
    {"col" : 'LandContour', "mapping": {'Low':0, 'Lvl':1, 'Bnk':2, 'HLS':3}},
    {"col" : 'Utilities', "mapping": {'ELO':0, 'NoSeWa':1, 'NoSewr':2, 'AllPub':3}},
    {"col" : 'LandSlope', "mapping": {'Gtl': 0, 'Mod': 1, 'Sev':2}},
    {"col" : 'OverallQual', "mapping": {1: 0, 2: 1, 3:2, 4:3, 5:4, 6:5, 7:6, 8:7, 9:8, 10:9}},
    {"col" : 'OverallCond', "mapping": {1: 0, 2: 1, 3:2, 4:3, 5:4, 6:5, 7:6, 8:7, 9:8, 10:9}},
    {"col" : 'ExterQual', "mapping": {'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}},
    {"col" : 'ExterCond', "mapping": {'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}},
    {"col" : 'BsmtQual', "mapping": {'No Basement':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
    {"col" : 'BsmtCond', "mapping": {'No Basement':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},   
    {"col" : 'BsmtExposure', "mapping": {'No Basement':0, 'No':1, 'Mn':2, 'Av':3, 'Gd':4}},
    {"col" : 'BsmtFinType1', "mapping": {'No Basement':0, 'Unf':1, 'LwQ':2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6}}, 
    {"col" : 'BsmtFinType2', "mapping": {'No Basement':0, 'Unf':1, 'LwQ':2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6}},  
    {"col" : 'HeatingQC', "mapping": {'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}},
    {"col" : 'KitchenQual', "mapping": {'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}},
    {"col" : 'FireplaceQu', "mapping": {'No Fireplace':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
    {"col" : 'GarageFinish', "mapping": {'No Garage':0, 'Unf':1, 'RFn':2, 'Fin':3}},
    {"col" : 'GarageQual', "mapping": {'No Garage':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
    {"col" : 'GarageCond', "mapping": {'No Garage':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
    {"col" : 'PavedDrive', "mapping": {'N':0, 'P':1, 'Y':2}}, 
    {"col" : 'PoolQC', "mapping": {'No Pool':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
    {"col" : 'Fence', "mapping":{'No Fence':0, 'MnWw':1, 'GdWo':2, 'MnPrv':3, 'GdPrv':4}}
]

# Applying ordinal encoder
ordenc = ce.OrdinalEncoder(mapping = ordinal_cols_mapping, return_df = True)
df_ord_enc = ordenc.fit_transform(df_card_enc)  
df_ord_enc.head()

**Datetime Variables:** There are variables denoting dates and thus, may hold significance and impact our target variable: the house's sale price. 

Based on research, we thought that the most sensible option would be to transform the datetime variables into ordinal categories in twofold:
 - Direct encoding of 'MoSold' and 'YrSold' having 12 and 5 pre-defined categories that are the 12 months and 5 years respectively during which the houses in the dataset were sold.
 - Binning of 'YearRemodAdd' and 'YearBuilt' into 6 categories of 10 and 20 years of interval respectively before proceding to ordinal encoding as well.

In [None]:
# Binning date variables in time intervals
df_ord_enc['YearRemodAdd'] = pd.cut(df_ord_enc['YearRemodAdd'], bins=6, precision=0).astype(str)
df_ord_enc['YearRemodAdd'].value_counts()

In [None]:
df_ord_enc['YearBuilt'] = pd.cut(df_ord_enc['YearBuilt'], bins=6, precision=0).astype(str)
df_ord_enc['YearBuilt'].value_counts()

In [None]:
df_ord_enc['GarageYrBlt'] = pd.cut(df_ord_enc[df_ord_enc['GarageYrBlt'] != 0]['GarageYrBlt']
                                   , bins=6, precision=0).astype(str)
df_ord_enc['GarageYrBlt'].fillna("No Garage", inplace = True)
df_ord_enc['GarageYrBlt'].value_counts()

In [None]:
# Datetime variable - ordinal encoding
date_cols_mapping = [ 
    {"col" : 'YrSold', "mapping": {2006:0, 2007: 1, 2008:2, 
                                   2009:3, 2010:4}},
    
    {"col" : 'MoSold', "mapping": {1: 0, 2: 1, 3:2, 4:3, 5:4, 6:5, 7:6, 
                                   8:7, 9:8, 10:9, 11:10, 12:11}},
    
    {"col" : 'YearRemodAdd', "mapping": {'(1950.0, 1960.0]':0, '(1960.0, 1970.0]':1,
                                         '(1970.0, 1980.0]':2, '(1980.0, 1990.0]':3,
                                         '(1990.0, 2000.0]':4, '(2000.0, 2010.0]':5}},
    
    {"col" : 'YearBuilt', "mapping": {'(1872.0, 1895.0]':0, '(1895.0, 1918.0]':1,
                                         '(1918.0, 1941.0]':2, '(1941.0, 1964.0]':3,
                                         '(1964.0, 1987.0]':4, '(1987.0, 2010.0]':5}},
    
    {"col" : 'GarageYrBlt', "mapping": {'No Garage':0, '(1900.0, 1918.0]':1,
                                        '(1918.0, 1937.0]':2, '(1937.0, 1955.0]':3,
                                        '(1955.0, 1973.0]':4, '(1973.0, 1992.0]':5,
                                        '(1992.0, 2010.0]':6}},
]

# Applying label encoder
ordenc = ce.OrdinalEncoder(mapping = date_cols_mapping, return_df = True)
df_final = ordenc.fit_transform(df_ord_enc)  
df_final.head()

# 4. Correlation Analysis

**Note:** Given that we have 240 columns, it would be quite computationally intensive to display the entire correlation matrix and visualize it for user-friendly analysis. As a result, we will only filter out relatively and highly correlated relationship with coefficient between 0.7 and 1 (non-inclusive to avoid pairs of identical variables).

In [None]:
## Strongest Relationships
matrix_corr = df_final.corr()
matrix_corr = np.round(matrix_corr.unstack(), 2)
strong_rel = matrix_corr[(abs(matrix_corr) >= 0.7) & (abs(matrix_corr) != 1.00)]
strong_rel

**Takeaway:** We detected 98 relationships of which we assume 40 to be unique pairs having a correlation coefficients greather than 0.75. Given that our focus is on the sale of the house, we will filter out only relationships related to variables with 'Sale' as a prefix.

In [None]:
# Focus on variables directly related to the sale
matrix_corr = df_final.corr()
matrix_sale = matrix_corr.filter(regex='^Sale',axis=1)
matrix_sale = np.round(matrix_sale.unstack(), 2)
strong_rel_sale = matrix_sale[(abs(matrix_sale) >= 0.7) & (abs(matrix_sale) != 1.00)]
strong_rel_sale

**Takeaway:** Among the two detected, 'OverallQual' appears to be a key variable that is highly correlated to the sale price of the house. It probably falls inline with our expectations in terms of the valuation process of a house; indeed, it requires good to excellent quality to be desired on the real estate market.

# 7. Machine Learning Set-Up

First off, we need to prepapre the data to feed the machine learning models. In doing so, we first separate the features and target variables and then proceed in creating train and testing set for model training and performance evaluation.

In [None]:
# Splitting features & target variable
X = df_final.drop(['SalePrice'], axis=1).values
y = df_final['SalePrice'].values
y_log = np.log(y)

In [None]:
## Training Testing Split
X_train, X_test, y_train, y_test = train_test_split(X, y_log, 
                                                    test_size=1/3, 
                                                    random_state=0)

# 8. Machine Learning - Simple Models

This section will leverage the powerful sci-kit-learn package to build multiple models with little to no parameter tuning for comparison. We will only use the cross-validation error on our training dataset to avoid any data leakage.

In [None]:
# Dictionary to store model structures
models = {'MLR': linear_model.LinearRegression(),
          'Ridge': linear_model.Ridge(),
          'Lasso': linear_model.Lasso(),
          'Elastic Net': linear_model.ElasticNet(),
          'Decision tree': DecisionTreeRegressor()
        }

**Note:** Our goal is predict house prices which are all non-negative values; however, our machine learning will likely also results in some negative values. To mitiagte this issue, we performed logarithmic transformation on our target values and then obtain the error rate accordingly. Read more [here](https://stats.stackexchange.com/questions/360399/how-to-constrain-gradient-boosting-predictions-to-be-non-negative.).

In [None]:
# Model Building & performance evaluation
kf = KFold(n_splits=5)
kf.get_n_splits(X_train)

for name, model in models.items():
    model_errs = []
    for train_index, test_index in kf.split(X_train):
        X_train_k = X_train[train_index] 
        y_train_k = y_train[train_index]
        model.fit(X_train_k, y_train_k)
        pred_log = model.predict(X_train_k)
        # pred = np.exp(pred_log)
        rmse = skm.mean_squared_error(y_train_k, pred_log, squared=False) 
        model_errs.append(rmse)
        # report performance
    print('{} - RMSLE: {:.5f} ({:.5f})' .format(name, np.mean(model_errs), 
                                               np.std(model_errs)))

# 10. Machine Learning - Ensemble Methods

This section will extend our work in machine learning to incorporate ensemble methods. We generated simple models and compared the scores, which appear satisfactory; however, we may want more stability and minor variation in our predictive algorithm; it is where ensemble techniques come in. Most often, they act as a 'superposer' of multiple models throughout various ways and thus, bolster their predictive power. Further Information [here](https://machinelearningmastery.com/tour-of-ensemble-learning-algorithms/). 

In [None]:
# Stacking multiple models
estimators = [
    ('MLR', linear_model.LinearRegression()),
    ('Ridge', linear_model.Ridge())
]
# Dictionary to store ensemble model structures
ensemble_models = {
    'RF': RandomForestRegressor(),
    'XGBoost': BaggingRegressor(base_estimator=xgb.XGBRegressor()),
    'Stacking': StackingRegressor(estimators=estimators,
                                   final_estimator=DecisionTreeRegressor())
}

In [None]:
# Model Building & performance evaluation
kf = KFold(n_splits=5)
kf.get_n_splits(X_train)

for name, model in ensemble_models.items():
    model_errs = []
    for train_index, test_index in kf.split(X_train):
        X_train_k = X_train[train_index] 
        y_train_k = y_train[train_index]
        model.fit(X_train_k, y_train_k)
        pred_log = model.predict(X_train_k)
        #pred = np.exp(pred_log)
        rmsle = skm.mean_squared_error(y_train_k, pred_log, squared=False) 
        model_errs.append(rmsle)
        # report performance
    print('{} - RMSLE: {:.5f} ({:.5f})' .format(name, np.mean(model_errs), 
                                               np.std(model_errs)))

**Takeaway:** In this instance, our best model is the XGBoost having the lowest RMSE for the log values of house prices. We will proceed in doing the hyperparameter tuning with the random Forest given that they offer more robust and reliable results than the simple ones. 

# 10. Hyperparameter Tuning

This section will walk through a process to find the best possible models given a set of parameters. In machine learning, we name it hyperparameter tuning during which the algorithm search for the set of optimal hyperarameters driving the metric as high (or low depending the case scenario) as possible. Further information [here](https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/). 

In [None]:
## Random Search
# Define the model
model = RandomForestRegressor()
# define search space
rf_space = {
   'n_estimators': range(20, 100, 20),
   'max_depth': range(3, 15, 3),
   'min_samples_split': [2, 5, 10], 
   'min_samples_leaf': [1, 2, 4]
}

In [None]:
search = GridSearchCV(estimator=model, param_grid=rf_space, 
                      cv=2, scoring='neg_root_mean_squared_error')
search.fit(X_train, y_train)
print('RF - RMSE of log values: {:.5f}' .format(search.best_score_))
print('with best parameters: {}.\n' .format(search.best_params_))

In [None]:
search = RandomizedSearchCV(estimator=model, param_distributions=rf_space, 
                            cv=2, scoring='neg_root_mean_squared_error')
search.fit(X_train, y_train)
print('RF - RMSE of log values: {:.5f}' .format(search.best_score_))
print('with best parameters: {}.\n' .format(search.best_params_))

# 11. Model Performance Evaluation

This section will build on everything we've done throughout this notebook and evaluate the best model using RMSE of log values.

In [None]:
# Running on testing set
tree = RandomForestRegressor(n_estimators=60, max_depth=12, 
                              min_samples_leaf=2, min_samples_split=5)
tree.fit(X_train, y_train)
y_pred_log = tree.predict(X_test)
rmsle = skm.mean_squared_error(y_test, y_pred_log, squared=False) 
# report performance
print('RF - RMSE of log values: {:.5f}' .format(rmsle))

# 12. Final Submission

In [None]:
# Loading the dataset
test_df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")

In [None]:
# List of columns with missing values
null_cols = test_df.columns[test_df.isna().any()].tolist()

# Imputation using group by
test_df['LotFrontage'] = test_df.groupby('LotShape').LotFrontage.transform(lambda x: x.fillna(x.median()))
test_df.LotFrontage = test_df.LotFrontage.round(2)
test_df['LotFrontage'].isnull().sum()
test_df.head()

# Replacing the null values with a significant term
test_df['Alley'].fillna("No alley access", inplace = True)
test_df['Alley'].value_counts()

# Grouping of variables dependent on the presence of a basement
for i in null_cols:
    if 'Bsmt' in i:
        test_df[i].fillna("No Basement", inplace = True)
        
# Grouping of variables dependent on the presence of a garage
for i in null_cols:
    if 'Garage' in i and i != 'GarageYrBlt':
        test_df[i].fillna("No Garage", inplace = True)
    elif i == 'GarageYrBlt':
        test_df[i].fillna(0, inplace = True)
        
other_cols_imp = {
    'MasVnrType': 'No Veneer',
    'MasVnrArea': 0, 
    'FireplaceQu': 'No Fireplace', 
    'PoolQC': 'No Pool', 
    'Fence': 'No Fence', 
    'MiscFeature': 'No Misc'
   }

# Grouping of variables dependent on the presence of other amenities
for i, j in other_cols_imp.items():
    test_df[i].fillna(j, inplace = True)

# Deleting the Electrical 
test_df.dropna(subset=['Electrical'], inplace=True)

In [None]:
## No. of null values
null_vals = test_df.isna().sum().sum()

# Reporting back
print("We are missing {:2d} values in our data at given percentages in the following columns:" .format(null_vals))
for i in null_cols:
    col_null = test_df[i].isnull().sum()
    per_null = col_null / len(test_df[i])
    print("  - {}: {} ({:.2%})".format(i, col_null, per_null))

In [None]:
#Given how few null values are left, we will drop those 12 rows.
test_df.dropna(subset=null_cols, inplace=True)

## No. of null values
null_vals = test_df.isna().sum().sum()

# Reporting back
print("We are missing {:2d} values in our data at given percentages in the following columns:" .format(null_vals))

In [None]:
# # Filter numeric columns
# df = test_df
# num_vars = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1',
#            'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
#            'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
#             'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']

# scaler = StandardScaler().fit(df[num_vars].values)
# df[num_vars] = scaler.transform(df[num_vars].values)
# df.head()

# # Encoding binary categorical variables
# binary = ['CentralAir']

# # Applying binary encoder
# binenc = ce.BinaryEncoder(cols = binary, return_df = True)
# bin_df = binenc.fit_transform(df)  
# bin_df.head()

# # List of nominal categorical/cardinal variables
# cardinal = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotConfig', 
#             'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 
#             'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 
#             'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 'Electrical', 
#             'Functional', 'GarageType', 'MiscFeature', 'SaleType', 
#             'SaleCondition']

# # Applying one-hot encoder 
# ohe = ce.OneHotEncoder(cols = cardinal, use_cat_names=True, return_df = True)
# df_card_enc = ohe.fit_transform(bin_df)  
# df_card_enc.head()

# # Encoding ordinal variables
# ordinal_cols_mapping = [ 
#     {"col" : 'LotShape', "mapping": {'Reg':0, 'IR1': 1, 'IR2':2, 'IR3':3}},
#     {"col" : 'LandContour', "mapping": {'Low':0, 'Lvl':1, 'Bnk':2, 'HLS':3}},
#     {"col" : 'Utilities', "mapping": {'ELO':0, 'NoSeWa':1, 'NoSewr':2, 'AllPub':3}},
#     {"col" : 'LandSlope', "mapping": {'Gtl': 0, 'Mod': 1, 'Sev':2}},
#     {"col" : 'OverallQual', "mapping": {1: 0, 2: 1, 3:2, 4:3, 5:4, 6:5, 7:6, 8:7, 9:8, 10:9}},
#     {"col" : 'OverallCond', "mapping": {1: 0, 2: 1, 3:2, 4:3, 5:4, 6:5, 7:6, 8:7, 9:8, 10:9}},
#     {"col" : 'ExterQual', "mapping": {'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}},
#     {"col" : 'ExterCond', "mapping": {'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}},
#     {"col" : 'BsmtQual', "mapping": {'No Basement':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
#     {"col" : 'BsmtCond', "mapping": {'No Basement':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},   
#     {"col" : 'BsmtExposure', "mapping": {'No Basement':0, 'No':1, 'Mn':2, 'Av':3, 'Gd':4}},
#     {"col" : 'BsmtFinType1', "mapping": {'No Basement':0, 'Unf':1, 'LwQ':2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6}}, 
#     {"col" : 'BsmtFinType2', "mapping": {'No Basement':0, 'Unf':1, 'LwQ':2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6}},  
#     {"col" : 'HeatingQC', "mapping": {'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}},
#     {"col" : 'KitchenQual', "mapping": {'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}},
#     {"col" : 'FireplaceQu', "mapping": {'No Fireplace':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
#     {"col" : 'GarageFinish', "mapping": {'No Garage':0, 'Unf':1, 'RFn':2, 'Fin':3}},
#     {"col" : 'GarageQual', "mapping": {'No Garage':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
#     {"col" : 'GarageCond', "mapping": {'No Garage':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
#     {"col" : 'PavedDrive', "mapping": {'N':0, 'P':1, 'Y':2}}, 
#     {"col" : 'PoolQC', "mapping": {'No Pool':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5}},
#     {"col" : 'Fence', "mapping":{'No Fence':0, 'MnWw':1, 'GdWo':2, 'MnPrv':3, 'GdPrv':4}}
# ]

# # Applying ordinal encoder
# ordenc = ce.OrdinalEncoder(mapping = ordinal_cols_mapping, return_df = True)
# df_ord_enc = ordenc.fit_transform(df_card_enc)  
# df_ord_enc.head()

# # Binning date variables in time intervals
# df_ord_enc['YearRemodAdd'] = pd.cut(df_ord_enc['YearRemodAdd'], bins=6, precision=0).astype(str)
# df_ord_enc['YearRemodAdd'].value_counts()

# df_ord_enc['YearBuilt'] = pd.cut(df_ord_enc['YearBuilt'], bins=6, precision=0).astype(str)
# df_ord_enc['YearBuilt'].value_counts()

# df_ord_enc['GarageYrBlt'] = pd.cut(df_ord_enc[df_ord_enc['GarageYrBlt'] != 0]['GarageYrBlt']
#                                    , bins=6, precision=0).astype(str)
# df_ord_enc['GarageYrBlt'].fillna("No Garage", inplace = True)
# df_ord_enc['GarageYrBlt'].value_counts()

# # Datetime variable - ordinal encoding
# date_cols_mapping = [ 
#     {"col" : 'YrSold', "mapping": {2006:0, 2007: 1, 2008:2, 
#                                    2009:3, 2010:4}},
    
#     {"col" : 'MoSold', "mapping": {1: 0, 2: 1, 3:2, 4:3, 5:4, 6:5, 7:6, 
#                                    8:7, 9:8, 10:9, 11:10, 12:11}},
    
#     {"col" : 'YearRemodAdd', "mapping": {'(1950.0, 1960.0]':0, '(1960.0, 1970.0]':1,
#                                          '(1970.0, 1980.0]':2, '(1980.0, 1990.0]':3,
#                                          '(1990.0, 2000.0]':4, '(2000.0, 2010.0]':5}},
    
#     {"col" : 'YearBuilt', "mapping": {'(1872.0, 1895.0]':0, '(1895.0, 1918.0]':1,
#                                          '(1918.0, 1941.0]':2, '(1941.0, 1964.0]':3,
#                                          '(1964.0, 1987.0]':4, '(1987.0, 2010.0]':5}},
    
#     {"col" : 'GarageYrBlt', "mapping": {'No Garage':0, '(1900.0, 1918.0]':1,
#                                         '(1918.0, 1937.0]':2, '(1937.0, 1955.0]':3,
#                                         '(1955.0, 1973.0]':4, '(1973.0, 1992.0]':5,
#                                         '(1992.0, 2010.0]':6}},
# ]

# # Applying label encoder
# ordenc = ce.OrdinalEncoder(mapping = date_cols_mapping, return_df = True)
# test_df_final = ordenc.fit_transform(df_ord_enc)  
# test_final.head()

In [None]:
# Defining our orignal model
# final_model = RandomForestRegressor(n_estimators=60, max_depth=12, 
#                               min_samples_leaf=2, min_samples_split=5)
# final_model.fit(X, y)
# pred_log = final_model.predict(test_df)
# pred = np.exp(pred_log)
# sub = pd.DataFrame({'ID': ids, 'target': pred})
# sub.to_csv('submission.csv', index=False)