# House Prices: Data preprocessing and StackingRegressor in a single pipeline.

The goal of this notebook, as you might have guessed by the title, is to construct a single pipeline containing data preprocessing and several stacking models (Ridge, Lasso, Gradient Boosting). This project is very light on future engineering and there is still a lot of further work to be done here.

**Why did I used a single pipeline?**

* Learning and practicing construction of sklearn pipelines, obviously.

* Doing preprocessing inside the pipeline prevents **train-test contamination**, which might be critical for cross-validation methods. For example, if we are imputing missing numerical values in both training and test data at the same time with mean value, our training data becomes corrupted (now it contains information from test data). Our model may give very good scores on testing or validation, but it's performance will become worse when we deploy it to make decisions on new data. When using cross-validation, it becomes very difficult to prevent train-test contamination, because the train-test split is happening inside the cv function. The solution is using pipelines: they exclude the validation data from any type of fitting, including the fitting of preprocessing steps.

**Credit to:**

* [Regularized Linear Models](https://www.kaggle.com/apapiu/regularized-linear-models) - log transformation of skewed numeric variables and linear models

* [Hyperparameters tunning with Hyperopt](https://www.kaggle.com/ilialar/hyperparameters-tunning-with-hyperopt#Hyperopt)

In [None]:
pip install scikit-learn==0.24.1

# Loading and exploring data

To begin, let's load some essential libraries and data from  [House Prices competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [None]:
# Essentials
import pandas as pd
import numpy as np

# Plots
import missingno as miss
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go

# Stats
from scipy.stats import skew

# Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, FunctionTransformer

# Models
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import StackingRegressor
from xgboost import XGBRegressor

# Hyperparameters tuning
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

# Misc
from sklearn.model_selection import cross_val_score
from sklearn import set_config

In [None]:
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv', index_col='Id')
X_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv', index_col='Id')

In [None]:
display(train.head(5))
print()
display(train.info())

In [None]:
display(X_test.head(5))
print()
display(X_test.info())

We have 1460 rows in training dataset and 1459 in test.

* Each row in the dataset describes the characteristics of a house - 79 features in total.

* Our goal is to predict the SalePrice, given these features.


# Exploring missing values

As we can see in results of `info()` method, there are missing values in dataset. Let's explore them.

First, we'll write a function to calculate how many missing values there are in each column. The function will display that in a dataframe  along with % of values missing and data type of that column.

In [None]:
def missing_values_table(data):
    
    missing_cout = data.isna().sum()
    
    missing_percent = (missing_cout / len(data)).apply('{0:.2%}'.format)
    
    data_types = data.dtypes
    
    miss_table = pd.concat([missing_cout, missing_percent, data_types], axis=1)
    miss_table = miss_table.loc[miss_table[0] > 0]
    miss_table.columns = ['Missing Values', '% of Total Values', 'Data Types']
    
    print(miss_table.shape[0], 'columns out of total',
          data.shape[1], 'columns in selected data have missing values.')
    print()
    display(miss_table.sort_values(by=['Data Types', 'Missing Values'], ascending=False))

And also we'll add a missing value matrix from `missingno` library.

In [None]:
missing_values_table(train)
miss.matrix(train)

In [None]:
missing_values_table(X_test)
miss.matrix(X_test)

There are a lot of missing values, mostly in categorical features. For most of them NA values means abscense of that feature for the house. For example:

* `Alley`: Type of alley access to property. NA - No alley access.

* `PoolQC`: Pool quality. NA - No Pool.

Let's compare prices of houses by top-6 features with most missing values.

In [None]:
def na_compare_prices(feature):
    not_na_price = train[train[feature].notna()].groupby(by=feature).SalePrice.mean()
    na_price = train[train[feature].isna()].SalePrice.mean()
    
    colors = ['MediumSeaGreen',] * len(not_na_price)
    colors.append('crimson')

    fig = go.Figure([go.Bar(
        x=[*not_na_price.index, 'NaN'],
        y=[*not_na_price, na_price],
        marker_color=colors,
    )])
    
    fig.update_layout(
        title={'text': 'SalePrice by ' + feature, 'x':0.45},
        xaxis_title_text=feature, 
        yaxis_title_text='SalePrice',
        width=800, height=400,
    )
    fig.show()

In [None]:
for feature in ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageType']:  
    na_compare_prices(feature)

I think it would be reasonable to handle missing values thusly:

* For categorical features - impute them with marker, like 'missing'.

* For numerical features - impute them using the mean value from k-Nearest Neighbors.

## Data distribution and skewness.


Fist, let's plot distribution of our target - SalePrice.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle('Target distribution: raw vs logarithmic')
sns.histplot(ax=axes[0], data=train.SalePrice, kde=True)
axes[0].set_title('SalePrice')
axes[0].text(520000, 150,
    'Skewness: {:.3f}\nKurtosis: {:>8.3f}'.format(train.SalePrice.skew(), train.SalePrice.kurt()),
    fontsize=12)
sns.histplot(ax=axes[1], data=np.log1p(train.SalePrice), kde=True)
axes[1].set_title('log(SalePrice + 1)')
axes[1].text(12.5, 128,
    'Skewness: {:.3f}\nKurtosis: {:>8.3f}'.format(np.log1p(train.SalePrice).skew(), np.log1p(train.SalePrice).kurt()),
    fontsize=12)
fig.show()

As we can see from the left plot, the SalePrice distribution is skewd to the right. Most ML models perform much better with normally distributed data. So, in order to enhance the performance of our models, we cam make target feature and other numeric features more normal by taking $log(feature + 1)$ - look at the plot on the right.

Let's plot distributions of othen numerical features as well.

In [None]:
numerical_cols = [cname for cname in train.columns if 
                train[cname].dtype in ['int64', 'float64']]

fig, axes = plt.subplots(len(numerical_cols)//3, 3, figsize=(20, 60))
for column, ax in zip(numerical_cols, axes.flat):
    sns.histplot(ax=ax, data=train[column], kde=True)
    ax.annotate(
        'Skewness: {:.3f}\nKurtosis: {:>8.3f}'.format(train[column].skew(), train[column].kurt()),
        xy=(0.65, 0.8), xycoords='axes fraction', bbox=dict(boxstyle="round", fc="w"))

plt.show()

# Pipeline

Let's start building our pipeline.

First, let's splt training dataset into features and target and perform the `numpy.log1p` transformation for target.

In [None]:
y_train = np.log1p(train.SalePrice)
X_train = train.drop(['SalePrice'], axis=1)

Then we'll make three lists of feature names for our preprocessing pipeline:

* categorical features with low cardinality

* numerical features

* skewed numerical features: features with `scipy.stats.skew` > 0.75

In [None]:
categorical_cols = [cname for cname in X_train.columns if
                        X_train[cname].nunique() < 15 and
                        X_train[cname].dtype == "object"]

numerical_cols = [cname for cname in X_train.columns if 
                X_train[cname].dtype in ['int64', 'float64']]

X_train = X_train[numerical_cols + categorical_cols]
X_test = X_test[numerical_cols + categorical_cols]

skewed_cols = X_train[numerical_cols].apply(lambda x: skew(x.dropna()))
skewed_cols = skewed_cols[skewed_cols > 0.75].index.tolist()

# Preprocessing pipeline

Set config so sklearn's Pipelines are displayed as neat diagrams.

In [None]:
set_config(display='diagram')

For categorical features, first we'll impute NA values with a constant marker value, and then encode them using One-Hot-Encoding.

In [None]:
categorical_transformer = Pipeline(steps=[
    ('SimpleImputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('OneHotEncoder', OneHotEncoder(handle_unknown='ignore')),
])
display(categorical_transformer)

For numerical features, we'll first apply `numpy.log1p` only to skewed columns using `ColumnTransformer`, with remaining unskewed passing through unchanged, and then impute NA values using `KNNImputer`.

In [None]:
numerical_transformer = Pipeline(steps=[
    ('Regularization', ColumnTransformer(
        transformers=[
            ('Skewed: np.log1p', FunctionTransformer(np.log1p), skewed_cols),
    ], remainder='passthrough')),
    ('KNNImputer', KNNImputer(n_neighbors=2, add_indicator=False)),
])

display(numerical_transformer)

Now we'll combine numerical and categorical transformers with `ColumnTransformer` for parallel preprocessing.

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numerical_transformer, numerical_cols),
        ('categorical', categorical_transformer, categorical_cols)
])
display(preprocessor)

# Ridge

Let's try sklearn's regularized linear regression models: `Ridge()` -  l_2 and `Lasso()` - l_1 regularisation.

First, we'll define cross-validation function, which will return mean RMSE (root mean squared error) over 3 k-folds.

In [None]:
def rmse_cv(pipe, X_train, y_train):
    return np.sqrt(-cross_val_score(pipe, X_train, y_train, scoring="neg_mean_squared_error", cv = 3, n_jobs=-1).mean())    

`Ridge()` without hyperparameter tuning.

In [None]:
pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', Ridge())
])

rmse_cv(pipe, X_train, y_train)

Let's tune `alpha` hyperparameter using Tree-structured Parzen Estimator algorithm from **Hyperopt** library. It uses Bayesian approach for optimization, which is much more handy than random search and grid search, especially for gradient boosting model.

We'll use hyperopt's `fmin` to find best hyperparametrs combination which minimises loss function of our pipline from set space of hyperparametrs.

In [None]:
def hyperopt_search(fn, space):
    trials = Trials()

    best_hyperparams=fmin(
        fn=fn, # function to optimize
        space=space, 
        algo=tpe.suggest, # optimization algorithm, hyperotp will select its parameters automatically
        max_evals=100, # maximum number of iterations
        trials=trials, # logging
        rstate=np.random.RandomState(123) # fixing random state for the reproducibility
    )

    print("The best hyperparameters are : ","\n")
    print(best_hyperparams)

Then we'll define function to optimise - RMSE of `Ridge()` model on cross-validation.

In [None]:
def ridge_rmse_cv(params, preprocessor=preprocessor, X_train=X_train, y_train=y_train, random_state=123):
    
    # the function gets a set of variable parameters in "params"
    params = {'alpha': params['alpha']}
    
    # we use this params to create a new pipeline
    model = Ridge(**params)    
    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # and then conduct the cross validation
    return rmse_cv(pipe, X_train, y_train)

In [None]:
space={'alpha': hp.loguniform('alpha', -10, 4)}

hyperopt_search(ridge_rmse_cv, space)

There is a slight enhancement in model's performance after hyperparametrs tuning.

# Lasso

`Lasso()` without hyperparameter tuning.

In [None]:
pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', Lasso())
])

rmse_cv(pipe, X_train, y_train)

`Lasso()` performed much worse out of the box. Let's see if we can make model better by tuning `alpha` in similar way.

In [None]:
def lasso_rmse_cv(params, preprocessor=preprocessor, X_train=X_train, y_train=y_train, random_state=123):
    
    params = {'alpha': params['alpha']}

    model = Lasso(**params)    
    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])

    return rmse_cv(pipe, X_train, y_train)

In [None]:
space={'alpha': hp.loguniform('alpha', -10, 4)}

hyperopt_search(lasso_rmse_cv, space)

Now it performs even better than `Ridge()`

# XGBRegressor

Now let's try some gradient boosting: we'll use `XGBRegressor` model.

In [None]:
pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', XGBRegressor(random_state=123))
])

rmse_cv(pipe, X_train, y_train)

Let's tune `n_estimators`, `max_depth` and `learning_rate` prameters of `XGBRegressor` model. (I did this step in Google Colab)

```python
def gb_rmse_cv(params, preprocessor=preprocessor, X_train=X_train, y_train=y_train, random_state=123):
    
    params = {'n_estimators': int(params['n_estimators']), 
              'max_depth': int(params['max_depth']), 
              'learning_rate': params['learning_rate']}

    model = XGBRegressor(random_state=random_state, **params)
    
    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])

    return rmse_cv(pipe, X_train, y_train)

space={
    'n_estimators': hp.quniform('n_estimators', 100, 2000, 1),
    'max_depth' : hp.quniform('max_depth', 2, 20, 1),
    'learning_rate': hp.loguniform('learning_rate', -5, 0)
}

hyperopt_search(gb_rmse_cv, space)
```

---

100%|██████████| 100/100 [23:42<00:00, 14.23s/it, best loss: 0.12645916421481454]
The best hyperparameters are :  

{'learning_rate': 0.05779730159845562, 'max_depth': 2.0, 'n_estimators': 765.0}

After hyperparametrs tuning gradient boosting showed very good results.

# StackingRegressor

The idea behind stacking models is rather simple: we take predictions of several uncorrelated models and use final model to add weights to those and make our final predictions. Usually this approach results in better predictions.

Let's combine our preprocessor and models with tuned hyperparameters into pipelines.

In [None]:
ridge_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('Ridge',Ridge(alpha=3.72))
])

lasso_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('Lasso', Lasso(alpha=0.00034))
])

xdbregressor_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('XGBRegressor', XGBRegressor(random_state=123, learning_rate=0.0578, max_depth=2, n_estimators=765))
])

estimators = [
    ('Ridge', ridge_pipe),
    ('Lasso', lasso_pipe),
    ('Gradient Boosting', xdbregressor_pipe),
]

And combine them into sklearn's `StackingRegressor`, using `Ridge()` model as our final estimator. (I tunned `alpha` in Colab)

In [None]:
stacking_regressor = StackingRegressor(estimators=estimators, final_estimator=Ridge(alpha=0.189))
display(stacking_regressor)

In [None]:
rmse_cv(stacking_regressor, X_train, y_train)

# Model performance

Let's make a simple line graph to compare RMSE of our models.

In [None]:
fig = go.Figure(go.Scatter(
    x=['Ridge', 'Lasso', 'XGBRegressor', 'StackingRegressor'], 
    y=[0.13304, 0.13014 ,0.12645 , 0.12387],
    mode='lines+markers'))

fig.update_layout(
    title={'text': 'Model performance', 'x':0.5},
    xaxis_title_text='Model', 
    yaxis_title_text='RMSE',
    width=700, height=400,
)
    
fig.show()

# Making submissions

In oreder to invert logarithmic transformation of our target (`SalePrice`), we'll use `numpy.expm1` on final model's predictions.

In [None]:
stacking_regressor.fit(X_train, y_train)
preds_test = np.expm1(stacking_regressor.predict(X_test))

In [None]:
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
display(output.head())
output.to_csv('submission.csv', index=False)