# House Prices: Advanced Regression Techniques


The [Ames Housing](http://jse.amstat.org/v19n3/decock.pdf) dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. <br><br>

I challenged myself not to copy/peek at any piece of code from other data science notebooks for this competition (although I googled a heckload of questions, and read many Kaggle discussion threads on this competition). I ended up spending a lot hours trying to figure things out on my own. On the plus side, I ended up learning a whole lot more than I originally thought, which has been very beneficial.

Importing libraries.

In [None]:
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
import missingno as msno
from sklearn.metrics import mean_squared_error
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

pd.set_option("display.max_rows", 2000)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 100)
pd.set_option("display.max_colwidth", 500)
pd.set_option("display.float_format", lambda x: "%.2f" % x)

from sklearn.inspection import permutation_importance
from IPython.display import display, HTML, display_html

display(HTML("<style>.container { width:100% !important; }</style>"))
import matplotlib as mpl
from matplotlib.ticker import MaxNLocator
import matplotlib.ticker as mticker
from lightgbm import LGBMRegressor
import optuna
from mlxtend.regressor import StackingCVRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PowerTransformer
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
import math
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_selection import VarianceThreshold
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

cwd = os.getcwd()
print(cwd)


import gc

gc.enable()

from bokeh.io import output_notebook, show
from bokeh.models import (
    BasicTicker,
    ColorBar,
    ColumnDataSource,
    LinearColorMapper,
    PrintfTickFormatter,
)
from bokeh.plotting import figure
from bokeh.transform import transform


random_state = 55

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


Running the code below so that IPython shows the entire result of the code I run. 
This becomes helpful to me while visualizing/analyzing high number of plots in the same output.

In [None]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

Loading Train and Test data sets.

In [None]:
train = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

A general overview of the data we need to work with. 
A few lines of code going from high level overview to lower/more detailed levels.

In [None]:
print(
    f"Train set contains {train.shape[0]} rows,{train.shape[1]} columns. \nTest set contains {test.shape[0]} rows, {test.shape[1]} columns.\n"
)
print(
    f"{set(train.columns) - set(test.columns)} are the fields that are IN TRAIN and NOT IN TEST.\n {set(test.columns) - set(train.columns)} are the fields that are IN TEST and NOT IN TRAIN. "
)


**Checking to see data types and potential missing values**:

In [None]:
train.info()

**Descriptive statistics:**

In [None]:
display(
    train.describe().iloc[:, 0:18].applymap("{:,g}".format),
    train.describe().iloc[:, 18:].applymap("{:,g}".format),
)


Let's look at a sample of records...

In [None]:
sample_count = 5

display(
    train.sample(sample_count, random_state=random_state)
    .iloc[:, :30]
    .style.hide_index(),
    train.sample(sample_count, random_state=random_state)
    .iloc[:, 30:60]
    .style.hide_index(),
    train.sample(sample_count, random_state=random_state)
    .iloc[:, 60:]
    .style.hide_index(),
)


Extracting the ID since we need to use it for submission later.<br>
I also concatenate train and test sets into one dataframe.

In [None]:
Id = "Id"

submission_ID = test.loc[:, Id]

train.drop(Id, axis=1, inplace=True)
test.drop(Id, axis=1, inplace=True)

# For identification purposes
train.loc[:, "Train"] = 1
test.loc[:, "Train"] = 0

test["SalePrice"] = 0

stacked_DF = pd.concat([train, test], ignore_index=True)


# Target Variable Distribution - Univariate

Let's look at the target variable distribution. I am creating an additional plot to show how log transformation impacts the variable since it's skewed.

In [None]:
params = {
    "axes.labelsize": 12,
    "axes.titlesize": 16,
    "xtick.labelsize": 11,
    "ytick.labelsize": 11,
}
plt.rcParams.update(params)

(fig, ax) = plt.subplots(nrows=2, ncols=1, figsize=[8, 10], sharex=True)

ax[0].set_ylabel("Count")
ax[0].yaxis.label.set_color('midnightblue')
ax[0].title.set_color('midnightblue')
ax[1].set_xlabel("SalePrice")
ax[1].set_ylabel("Count")
ax[1].xaxis.label.set_color('midnightblue')
ax[1].yaxis.label.set_color('midnightblue')
ax[1].title.set_color('midnightblue')
ax[1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))

plot_X = stacked_DF.loc[stacked_DF["Train"] == 1]["SalePrice"]

plot = ax[0].hist(plot_X, bins=75, log=False)
plot = ax[1].hist(plot_X, bins=75, log=True)

ax[0].set_title("Sale Price")
ax[1].set_title("Sale Price (Log Transformed)")


It looks like log transformation will be relatively successful gaussianize the target variable.
Even after the log transformation, I see some outliers on the right hand side of the spectrum. However, I am not sure if there's anything I can do for those.



# Bivariate Analyis

I will first plot a few numeric variables against the target variable, then I will do the same for categorical variables.
What I am looking for is correlation, outliers and the distribution of the target variable with respect to dependent variables.<br>
Scatter Plot is good for numeric variable visualization. For categorical variables, I will use Box-plot.

In [None]:
params = {"axes.labelsize": 12, 
          "xtick.labelsize": 11, 
          "ytick.labelsize": 11}
plt.rcParams.update(params)

features_to_viz = [
    "GrLivArea",
    "YearBuilt",
    "WoodDeckSF",
    "LotArea",
    "GarageArea",
    "1stFlrSF",
    "2ndFlrSF",
    "TotalBsmtSF",
    "LotFrontage",
    "GarageYrBlt",
]

# Because there are a lot of variables to vizualize,
# sorting them helps me keep track of which variable is where

features_to_viz = sorted(features_to_viz)

ncols = 2
nrows = math.ceil(len(features_to_viz) / ncols)
unused = nrows * ncols - len(features_to_viz)

figw = ncols * 5
figh = nrows * 4

(fig, ax) = plt.subplots(nrows, ncols, sharey=True, figsize=(figw, figh))
fig.subplots_adjust(hspace=0.3, wspace=0.1)
ax = ax.flatten()

for i in range(unused, 0, -1):
    fig.delaxes(ax[-i])

for (n, col) in enumerate(features_to_viz):
    if n % 2 != 0:
        ax[n].yaxis.label.set_visible(False)
    ax[n].set_xlabel(col)
    ax[n].set_ylabel("SalePrice")
    ax[n].xaxis.label.set_color('midnightblue')
    ax[n].yaxis.label.set_color('midnightblue')
    ax[n].yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
    sns.scatterplot(
        x=col,
        y="SalePrice",
        data=stacked_DF.loc[stacked_DF["Train"] == 1],
        legend=False,
        ax=ax[n],
    )

plt.show()

**Looking at the graphs above, I find a few things to note:**

* GrLivArea, YrBuilt, LotArea and a few others have linear-like relationship with the target variable.
* There are a few outliers. Data source's notes mention these outliers, and recommends to remove them. I, however, don't like the idea of removing records as we're already working with limited amount of samples.

Moving on with the categorical bivariate analysis.

In [None]:
features_to_viz = [
    "BsmtQual",
    "ExterQual",
    "FireplaceQu",
    "ExterCond",
    "KitchenQual",
    "LotShape",
    "OverallQual",
    "FullBath",
    "HalfBath",
    "TotRmsAbvGrd",
    "Fireplaces",
    "KitchenAbvGr",
    "Neighborhood",
]

ncols = 2
nrows = math.ceil(len(features_to_viz) / ncols)
unused = nrows * ncols - len(features_to_viz)

(figw, figh) = (ncols * 7, nrows * 6)

(fig, ax) = plt.subplots(nrows, ncols, sharey=False, figsize=(figw, figh))
fig.subplots_adjust(hspace=0.2, wspace=0.1)

ax = ax.flatten()
for i in range(unused, 0, -1):
    fig.delaxes(ax[-i])

for (n, col) in enumerate(features_to_viz):
    if(col=="Neighborhood"):
        ordering = (
        stacked_DF.loc[stacked_DF["Train"] == 1]
        .groupby(by=col)["SalePrice"]
        .median()
        .sort_values()
        .index
        )
        sns.boxplot(
        x="SalePrice",
        y=col,
        data=stacked_DF.loc[stacked_DF["Train"] == 1],
        order=ordering,
        ax=ax[n],
        orient="h")
        ax[n].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
        ax[n].xaxis.label.set_color('midnightblue')
        ax[n].yaxis.label.set_color('midnightblue')
    else:
        ordering = (
            stacked_DF.loc[stacked_DF["Train"] == 1]
            .groupby(by=col)["SalePrice"]
            .median()
            .sort_values()
            .index
        )
        if(n%2==1):
            visibility=False
        else:
            visibility=True
        ax[n].get_yaxis().set_visible(visibility)
        sns.boxplot(
            y="SalePrice",
            x=col,
            data=stacked_DF.loc[stacked_DF["Train"] == 1],
            order=ordering,
            ax=ax[n],
            orient="v",
        )
        ax[n].xaxis.label.set_color('midnightblue')
        ax[n].yaxis.label.set_color('midnightblue')
        ax[n].yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))

plt.show()

* OverallQual impacts SalePrice exponentially!
* Neighborhood matters. Features regarding quality also matter.
* **The more irregular the lot shape is, the higher the Sale Price seems**. This was a surprise to me. That being said, I can relate irregular lot shape to architectural originality, which costs money. Regular lot shape relates to just regular homes that more of us can afford. But this is only a guess.

# Missing Values

I will first look at number of missing records of each feature. Then I will utilize a visualization method to see relations between features with missing records. These relations may provide guidance dealing with missing records. For example, if there are a lot of houses where all garage associated features are missing, this most likely indicates the absence of a garage for those houses.

In [None]:
print("Missing Value Counts in Train DF")
stacked_DF[stacked_DF["Train"] == 1].isna().sum()[
    stacked_DF[stacked_DF["Train"] == 1].isna().sum() > 0
].sort_values(ascending=False)

In [None]:
print("Missing Values in Test DF")
stacked_DF[stacked_DF["Train"] == 0].isna().sum()[
    stacked_DF[stacked_DF["Train"] == 0].isna().sum() > 0
].sort_values(ascending=False)


In [None]:
# Check missing records in train set
na_cols = (stacked_DF.isna().sum()[stacked_DF.isna().sum() > 0]).index
mat = msno.matrix(
    stacked_DF.loc[:, na_cols], labels=True, figsize=(14, 10), fontsize=12, inline=False
)

# Imputing Missing Values and Feature Engineering

Below I will impute missing records and do some feature engineering to clean up the data before modeling. I will also establish new features based on some of the existing features. 

In [None]:
# Assuming Neighborhood and MSZoning are related.
lookup = (
    stacked_DF.loc[stacked_DF["Train"] == 1]
    .groupby(by="Neighborhood")["MSZoning"]
    .agg(pd.Series.mode)
)
stacked_DF["MSZoning"] = stacked_DF["MSZoning"].fillna(
    stacked_DF["Neighborhood"].map(lookup)
)

# Assuming KitchenQual and OverallQual are related.
lookup = (
    stacked_DF.loc[stacked_DF["Train"] == 1]
    .groupby(by="OverallQual")["KitchenQual"]
    .agg(pd.Series.mode)
)
stacked_DF["KitchenQual"] = stacked_DF["KitchenQual"].fillna(
    stacked_DF["OverallQual"].map(lookup)
)

# For these features I replace nan with a string indicator: "missing"
cols_na_to_missing = {
    "Alley",
    "BsmtCond",
    "BsmtExposure",
    "BsmtFinType1",
    "BsmtFullBath",
    "BsmtQual",
    "Fence",
    "FireplaceQu",
    "GarageCond",
    "GarageFinish",
    "GarageQual",
    "GarageType",
    "MasVnrType",
    "MiscFeature",
    "PoolQC",
    "BsmtFinType2",
}

# For these features I replace nan with the integer 0
cols_na_to_zero = {
    # 'BsmtUnfSF',
    "GarageArea",
    "GarageCars",
    "TotalBsmtSF",
    "MasVnrArea",
    "BsmtFinSF1",
    "BsmtFinSF2",
    "BsmtFullBath",
    "BsmtHalfBath",
    "GarageYrBlt",
}

# For these features I replace nan with the mode of the feature the record is missing.
cols_na_to_mode = {
    "Functional",
    "Electrical",
    "Utilities",
    "Exterior1st",
    "Exterior2nd",
    "SaleType",
}

for col in cols_na_to_missing:
    stacked_DF[col] = stacked_DF[col].astype(object).fillna("Missing")

for col in cols_na_to_zero:
    stacked_DF[col] = stacked_DF[col].astype(object).fillna(0)

for col in cols_na_to_mode:
    stacked_DF[col] = (
        stacked_DF[col]
        .astype(object)
        .fillna(stacked_DF.loc[stacked_DF["Train"] == 1, col].mode()[0])
    )

# Imputing remaining missing values with the help of iterative imputer.
num_features = stacked_DF.drop(columns=["Train"]).select_dtypes("number").columns
imputer = IterativeImputer(
    RandomForestRegressor(max_depth=8),
    n_nearest_features=10,
    max_iter=10,
    random_state=random_state,
)
stacked_DF.loc[stacked_DF["Train"] == 1, num_features] = imputer.fit_transform(
    stacked_DF.loc[stacked_DF["Train"] == 1, num_features].values
)
stacked_DF.loc[stacked_DF["Train"] == 0, num_features] = imputer.transform(
    stacked_DF.loc[stacked_DF["Train"] == 0, num_features].values
)


In [None]:
stacked_DF["WarmSeason"] = np.where(
    stacked_DF["MoSold"].isin([10, 11, 12, 1, 2, 3]), 0, 1
)
stacked_DF["SqFtPerRoom"] = stacked_DF["GrLivArea"] / (
    stacked_DF["TotRmsAbvGrd"]
    + stacked_DF["FullBath"]
    + stacked_DF["HalfBath"]
    + stacked_DF["KitchenAbvGr"]
)

# Converting MSSubClass to categorical
stacked_DF["MSSubClass"] = stacked_DF["MSSubClass"].astype(str)

# I combine underrepresented categories under one umbrella and/or with another category in the same field
ext2_map = {"AsphShn": "Oth1", "CBlock": "Oth1", "CmentBd": "Oth2", "Other": "Oth2"}
roofmatl_map = {
    "Roll": "Oth1",
    "ClyTile": "Oth1",
    "Metal": "Oth1",
    "CompShg": "Oth1",
    "Membran": "Oth2",
    "WdShake": "Oth2",
}

cond2_map = {"PosA": "Pos", "PosN": "Pos", "RRAe": "Norm", "RRAn": "Norm"}


stacked_DF["Exterior2nd"] = (
    stacked_DF["Exterior2nd"].map(ext2_map).fillna(stacked_DF["Exterior2nd"])
)
stacked_DF["RoofMatl"] = (
    stacked_DF["RoofMatl"].map(roofmatl_map).fillna(stacked_DF["RoofMatl"])
)
stacked_DF["Condition2"] = (
    stacked_DF["Condition2"].map(cond2_map).fillna(stacked_DF["Condition2"])
)

# Dropping a handful of features as there are other variables that are perfectly correlated to these
# I did trial and error here based on the impact of removing features on RMSE.
stacked_DF.drop(columns=["GarageYrBlt", "Utilities"], inplace=True)

# Correlation Heatmap

Let's look the correlation heatmap to see which independent variables are correlated.<br> In an ideal scenario, we wouldn't see high correlation **among independent variables**. I trust our models will pick features that are important to them, without us needing to deal with multicollinearity. In fact, when I ran the model after picking and eliminating correlated features the model accuracy reduced.<br>

As useful as they are, heatmaps get messy pretty quickly dealing with high number of variables. Bokeh provides interactive plotting, which means I can hover over a red square to find out which two features have high correlation.

In [None]:
output_notebook()

df_to_viz = stacked_DF[stacked_DF["Train"] == 1].drop(columns="Train")

xcorr = abs(df_to_viz.corr())
xcorr.index.name = "Feature1"
xcorr.columns.name = "Feature2"

df = pd.DataFrame(xcorr.stack(), columns=["Corr"]).reset_index()

source = ColumnDataSource(df)

colors = [
    "#75968f",
    "#a5bab7",
    "#c9d9d3",
    "#e2e2e2",
    "#dfccce",
    "#ddb7b1",
    "#cc7878",
    "#933b41",
    "#550b1d",
]

mapper = LinearColorMapper(palette=colors, low=df.Corr.min(), high=df.Corr.max())

f1 = figure(
    plot_width=800,
    plot_height=800,
    title="Correlation Heat Map",
    x_range=list(sorted(xcorr.index)),
    y_range=list(reversed(sorted(xcorr.columns))),
    toolbar_location=None,
    tools="hover",
    x_axis_location="above",
)

f1.rect(
    x="Feature2",
    y="Feature1",
    width=1,
    height=1,
    source=source,
    line_color=None,
    fill_color=transform("Corr", mapper),
)

color_bar = ColorBar(
    color_mapper=mapper,
    location=(0, 0),
    ticker=BasicTicker(desired_num_ticks=len(colors)),
    formatter=PrintfTickFormatter(format="%d%%"),
)
f1.add_layout(color_bar, "right")

f1.hover.tooltips = [
    ("Feature1", "@{Feature1}"),
    ("Feature2", "@{Feature2}"),
    ("Corr", "@{Corr}{1.1111}"),
]

f1.axis.axis_line_color = None
f1.axis.major_tick_line_color = None
f1.axis.major_label_text_font_size = "12px"
f1.axis.major_label_standoff = 2
f1.xaxis.major_label_orientation = 1.0

show(f1)


# Final Preprocessing Steps

In the next two sections, I will apply one hot encoding to categoricals, and scaling on numericals (except boolean-like features). Once that's done, I will have all the transformed variables go through Variance Threshold. Variance Threshold will remove any feature that shows less variance than what I want.  

There are a few columns with different set of categories among train and test sets
This causes an issue for one hot encoder's "drop" rule.<br> Therefore, I am applying seperate OneHot encoding to two different subset of categoricals.

In [None]:
# Obtaining a list of categorical, numerical, and boolean - like features.
bool_features = [
    col
    for col in stacked_DF.select_dtypes(include=["number"]).columns
    if np.array_equal(
        np.sort(stacked_DF[col].unique(), axis=0), np.sort([0, 1], axis=0)
    )
]

cat_features = [col for col in stacked_DF.select_dtypes(exclude=["number"]).columns]
num_features = [
    col
    for col in stacked_DF.select_dtypes(include=["number"]).columns
    if col not in (bool_features) and col != "SalePrice"
]

# Holding these two DF 's on the side.
# Will need to concatenate later with the preprocessed(scaled and oh encoded) DF.
bool_features.remove("Train")
X_train_bool = stacked_DF.loc[stacked_DF["Train"] == 1, bool_features]
X_test_bool = stacked_DF.loc[stacked_DF["Train"] == 0, bool_features]

# This list contains features that has the same set of values between train - test
ohe_cols_a = []

# This list contains features that has different set of values between train - test
ohe_cols_b = []

for col in cat_features:
    if set(stacked_DF.loc[stacked_DF["Train"] == 1, col].unique()) != set(
        stacked_DF.loc[stacked_DF["Train"] == 0, col].unique()
    ):
        ohe_cols_b.append(col)

ohe_cols_a = list(set(cat_features) - set(ohe_cols_b))

In [None]:
X_train = stacked_DF.loc[stacked_DF["Train"] == 1].drop(
    labels=["SalePrice", "Train"], axis=1
)
# Applying log transformation to the target variable
y_train = stacked_DF.loc[stacked_DF["Train"] == 1, "SalePrice"].apply(np.log)
X_test = stacked_DF.loc[stacked_DF["Train"] == 0].drop(
    labels=["SalePrice", "Train"], axis=1
)


preprocessor = ColumnTransformer(
    transformers=[
        ("onehota", OneHotEncoder(sparse=False, drop="first"), ohe_cols_a),
        ("onehotb", OneHotEncoder(sparse=False, handle_unknown="ignore"), ohe_cols_b),
        ("scaler", StandardScaler(), num_features),
    ],
    remainder="drop",
)


pipeline = Pipeline(
    [("Preprocessor", preprocessor), ("VarThreshold", VarianceThreshold(0.01))]
)

X_train_preprocessed = pipeline.fit_transform(X_train)
X_test_preprocessed = pipeline.transform(X_test)

# Get the list of one hot encoded columns and combine them
oh_encoded_a = list(
    preprocessor.named_transformers_.onehota.get_feature_names(ohe_cols_a)
)
oh_encoded_b = list(
    preprocessor.named_transformers_.onehotb.get_feature_names(ohe_cols_b)
)
oh_encoded_cols = oh_encoded_a + oh_encoded_b

feature_names = np.array(oh_encoded_cols + num_features, order="K")

# Filtering out the features dropped by variance threshold
feature_names = feature_names[pipeline.named_steps.VarThreshold.get_support()]

# Putting back the column names to help with analysis
X_train_preprocessed = pd.DataFrame(data=X_train_preprocessed, columns=feature_names)
X_test_preprocessed = pd.DataFrame(
    data=X_test_preprocessed, columns=feature_names, index=X_test_bool.index
)

# Combine the DF's back together
X_train = pd.concat([X_train_bool, X_train_preprocessed], axis=1)
X_test = pd.concat([X_test_bool, X_test_preprocessed], axis=1)


# Feature Importance

Lasso is a good feature selection model for datasets like ours (high number of features, low sample size). <br>In addition to Lasso, I'll be using sklearn's permutation feature importance model. The pitfall of permutation importance is that it doesn't deal well with correlated variables. This is something to keep in mind when we're interpreting the results.

In [None]:
model = Lasso(alpha=0.01)
model.fit(X_train, y_train)


feature_imp = permutation_importance(
    model, X_train, y_train, n_repeats=10, n_jobs=-1, random_state=random_state
)

perm_ft_imp_df = pd.DataFrame(
    data=feature_imp.importances_mean, columns=["FeatureImp"], index=X_train.columns
).sort_values(by="FeatureImp", ascending=False)
model_ft_imp_df = pd.DataFrame(
    data=model.coef_, columns=["FeatureImp"], index=X_train.columns
).sort_values(by="FeatureImp", ascending=False)

fig, ax = plt.subplots(2, 1, figsize=(10, 16))

perm_ft_imp_df_nonzero = perm_ft_imp_df[perm_ft_imp_df["FeatureImp"] != 0]
model_ft_imp_df_nonzero = model_ft_imp_df[model_ft_imp_df["FeatureImp"] != 0]

sns.barplot(
    x=perm_ft_imp_df_nonzero["FeatureImp"],
    y=perm_ft_imp_df_nonzero.index,
    ax=ax[0],
    palette="vlag",
)
sns.barplot(
    x=model_ft_imp_df_nonzero["FeatureImp"],
    y=model_ft_imp_df_nonzero.index,
    ax=ax[1],
    palette=sns.diverging_palette(10, 220, sep=2, n=80),
)

ax[0].set_title("Permutation Feature Importance")
ax[1].set_title("Lasso Feature Importance")

plt.show()

Features like OverallQual, Neighborhood, GrLivArea and Year built rank high in our models, as one would expect.

# Hyperparameter optimizing with Optuna

Having completed all preprocessing steps, let's continue with model selection and hyperparameter optimization. <br>
<br> Due to high feature/low sample size of our data, there's a great chance of overfitting. Models with regularization mechanism such as Lasso and Ridge do well in regards to such data. In addition Lasso and Ridge, I'll use SVR, LGBM and RandomForest regressor. 
<br><br>
I first used Optuna's search algorithm to get a starting point of hyperparameters. Then I did a manual tweaking/testing to further increase the accuracy. <br> What I like more about Optuna over Grid Search Algorithm is that it gives me the opportunity to set a "search range" instead of me having to declare values one by one for it to search from.<br> It also works faster compared to GridSearch. <br><br> I am leaving hyperparameter search section commented out as I already obtained the parameters I'll run the models with.

In [None]:
# def objective(trial):
#     _C = trial.suggest_float("C", 0.1, 0.5)
#     _epsilon = trial.suggest_float("epsilon", 0.01, 0.1)
#     _coef = trial.suggest_float("coef0", 0.5, 1)

#     svr = SVR(cache_size=5000, kernel="poly", C=_C, epsilon=_epsilon, coef0=_coef)

#     score = cross_val_score(
#         svr, X_train, y_train, cv=cv, scoring="neg_root_mean_squared_error"
#     ).mean()
#     return score


# optuna.logging.set_verbosity(0)

# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=100)

# svr_params = study.best_params
# svr_best_score = study.best_value
# print(f"Best score:{svr_best_score} \nOptimized parameters: {svr_params}")

In [None]:
# def objective(trial):

#     _alpha = trial.suggest_float("alpha", 0.5, 1)
#     _tol = trial.suggest_float("tol", 0.5, 0.9)

#     ridge = Ridge(alpha=_alpha, tol=_tol, random_state=random_state)

#     score = cross_val_score(
#         ridge, X_train, y_train, cv=cv, scoring="neg_root_mean_squared_error"
#     ).mean()
#     return score


# optuna.logging.set_verbosity(0)

# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=100)

# ridge_params = study.best_params
# ridge_best_score = study.best_value
# print(f"Best score:{ridge_best_score} \nOptimized parameters: {ridge_params}")


In [None]:
# def objective(trial):

#     _alpha = trial.suggest_float("alpha", 0.0001, 0.01)

#     lasso = Lasso(alpha=_alpha, random_state=random_state)

#     score = cross_val_score(
#         lasso, X_train, y_train, cv=cv, scoring="neg_root_mean_squared_error"
#     ).mean()
#     return score


# optuna.logging.set_verbosity(0)

# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=100)

# lasso_params = study.best_params
# lasso_best_score = study.best_value
# print(f"Best score:{lasso_best_score} \nOptimized parameters: {lasso_params}")


In [None]:
# def objective(trial):
#     _n_estimators = trial.suggest_int("n_estimators", 50, 200)
#     _max_depth = trial.suggest_int("max_depth", 5, 12)
#     _min_samp_split = trial.suggest_int("min_samples_split", 2, 8)
#     _min_samples_leaf = trial.suggest_int("min_samples_leaf", 3, 6)
#     _max_features = trial.suggest_int("max_features", 10, 50)

#     rf = RandomForestRegressor(
#         max_depth=_max_depth,
#         min_samples_split=_min_samp_split,
#         ccp_alpha=_ccp_alpha,
#         min_samples_leaf=_min_samples_leaf,
#         max_features=_max_features,
#         n_estimators=_n_estimators,
#         n_jobs=-1,
#         random_state=random_state,
#     )

#     score = cross_val_score(
#         rf, X_train, y_train, cv=cv, scoring="neg_root_mean_squared_error"
#     ).mean()
#     return score


# optuna.logging.set_verbosity(0)

# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=100)

# rf_params = study.best_params
# rf_best_score = study.best_value
# print(f"Best score:{rf_best_score} \nOptimized parameters: {rf_params}")


In [None]:
# def objective(trial):
#     _num_leaves = trial.suggest_int("num_leaves", 5, 20)
#     _max_depth = trial.suggest_int("max_depth", 3, 8)
#     _learning_rate = trial.suggest_float("learning_rate", 0.1, 0.4)
#     _n_estimators = trial.suggest_int("n_estimators", 50, 150)
#     _min_child_weight = trial.suggest_float("min_child_weight", 0.2, 0.6)

#     lgbm = LGBMRegressor(
#         num_leaves=_num_leaves,
#         max_depth=_max_depth,
#         learning_rate=_learning_rate,
#         n_estimators=_n_estimators,
#         min_child_weight=_min_child_weight,
#     )

#     score = cross_val_score(
#         lgbm, X_train, y_train, cv=cv, scoring="neg_root_mean_squared_error"
#     ).mean()
#     return score


# optuna.logging.set_verbosity(0)

# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=100)

# lgbm_params = study.best_params
# lgbm_best_score = study.best_value
# print(f"Best score:{lgbm_best_score} \nOptimized parameters: {lgbm_params}")


In [None]:
rf_params = {"max_depth": 8, "max_features": 40, "n_estimators": 132}
svr_params = {
    "kernel": "poly",
    "C": 0.053677105521141605,
    "epsilon": 0.03925943476562099,
    "coef0": 0.9486751042886584,
}
ridge_params = {
    "alpha": 0.9999189637151178,
    "tol": 0.8668539399622242,
    "solver": "cholesky",
}
lasso_params = {"alpha": 0.0004342843645993161, "selection": "random"}
lgbm_params = {
    "num_leaves": 16,
    "max_depth": 6,
    "learning_rate": 0.16060612646519587,
    "n_estimators": 64,
    "min_child_weight": 0.4453842422224686,
}

# Model Comparison

Let's use cross-validate-score to help us see how different models perform.

In [None]:
cv = KFold(n_splits=4, random_state=random_state)

svr = SVR(**svr_params)
ridge = Ridge(**ridge_params, random_state=random_state)
lasso = Lasso(**lasso_params, random_state=random_state)
lgbm = LGBMRegressor(**lgbm_params, random_state=random_state)
rf = RandomForestRegressor(**rf_params, random_state=random_state)
stack = StackingCVRegressor(
    regressors=[svr, ridge, lasso, lgbm, rf],
    meta_regressor=LinearRegression(n_jobs=-1),
    random_state=random_state,
    cv=cv,
    n_jobs=-1,
)

svr_scores = cross_val_score(
    svr, X_train, y_train, cv=cv, n_jobs=-1, error_score="neg_root_mean_squared_error"
)
ridge_scores = cross_val_score(
    ridge, X_train, y_train, cv=cv, n_jobs=-1, error_score="neg_root_mean_squared_error"
)
lasso_scores = cross_val_score(
    lasso, X_train, y_train, cv=cv, n_jobs=-1, error_score="neg_root_mean_squared_error"
)
lgbm_scores = cross_val_score(
    lgbm, X_train, y_train, cv=cv, n_jobs=-1, error_score="neg_root_mean_squared_error"
)
rf_scores = cross_val_score(
    rf, X_train, y_train, cv=cv, n_jobs=-1, error_score="neg_root_mean_squared_error"
)
stack_scores = cross_val_score(
    stack, X_train, y_train, cv=cv, n_jobs=-1, error_score="neg_root_mean_squared_error"
)

scores = [svr_scores, ridge_scores, lasso_scores, lgbm_scores, rf_scores, stack_scores]
models = ["SVR", "RIDGE", "LASSO", "LGBM", "RF", "STACK"]
score_medians = [
    round(np.median([mean for mean in modelscore]), 5) for modelscore in scores
]


In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

vertical_offset = 0.001

ax.set_title("Model Score Comparison")
bp = sns.boxplot(x=models, y=scores, ax=ax)


for xtick in bp.get_xticks():
    bp.text(
        xtick,
        score_medians[xtick] - vertical_offset,
        score_medians[xtick],
        horizontalalignment="center",
        verticalalignment="top",
        size=14,
        color="b",
    )

plt.show()


When I run the models, I see that the stacked model scores slightly better than SVR.<br> Even if the accuracy of one or two individual models rated higher than stacked model, I still would have picked stacked model.<br> This is because I believe that it will do a better job at generalizing/reduce impact of any overfitting compared to individual models. <br><br> The model trains quickly, and it is ~91% accurate, I find this quite impressive.

# Final Submission

I will first inverse log-transform then submit the final predictions.<br><br>

In [None]:
stack.fit(X_train.values, y_train.values)

predictions = stack.predict(X_test.values)
predictions = np.exp(predictions)

submission = pd.DataFrame({"Id": submission_ID, "SalePrice": predictions})
submission.to_csv("submission.csv", index=False)