<div style="display:fill; background-color:#000000;border-radius:5px;">
    <p style="font-size:300%; color:white;text-align:center";>Abstract</p>
</div>

The goal of this competition is to predict the sales price for houses, having 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. The training set has 1460 entries and the test set has 1459 entries. 

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.

In this notebook, I'm sharing my solution (CV : 0.1179  and LB : 0.1187) and the strategies that helped improved the score. When I started working on this dataset, I created a baseline model and it scored 0.1659. It required quite a lot of work to move from this initial score to 0.1187 : **creating new features, selecting the right model and hypertuning parameter really helped improved the score.**

**Feature engineering:**
* I tested One-Hot-Encoding and Target Encoding for categorical variables. The XGB model performed better with Target Encoding.
* I added new features using mathematical transforms (e.g. LowQualFinRatio=LowQualFinSF/GrLivArea), counts (e.g. Number of NaNs per house), group transforms (e.g. median of the LotArea per neighboorhood). Adding these features really helped improving the score.
* Scaling the features was also necessary for the Linear models. I used the StandardScaler and did not test any other scaler.

**Selecting the right model:**
* I started using XGB and Light GBM. The XGB model performed better than the LGBM one.
* I tried Linear models. It looks like on this dataset, Gradient Boosting Machines are more efficient than Linear models but Lasso (and Ridge) performed OK after clipping the predicted sales prices.

**Hyperparameter tuning:**
* I used Optuna to tune the hyperparameters. This was also a very useful step to improve the score.
* Using CV with 5 fold helped as the scores may differ quite significantly from one validation set to another.
* For GBM models, it was useful to let the tuning run a few hours.

**Stacking:**
* I blended my best XGB model with my best Linear model (Lasso). I used a basic Linear Regression to combine the predictions from the 2 models and generate the final prediction. Stacking did not really help, it decreased slightly the CV score and improved slightly the LB score. I kept it in this notebook but I wouldn't say it was useful.

**cv_loop():**
* I created a function cv_loop() and it was helpful to test different scenarios quickly.
* cv_loop() implements an end-to-end ML pipeline: 
    * feature selection : it keeps only the features passed to the function (input parameter useful_features) 
    * missing value imputation : it imputes the missing values with None for categorical features and 0 for numerical features
    * feature creation : it creates the new features (input parameter new_features)
    * encoding of categorical features : it encodes the categorical features, either One-Hot-Encoding or Target Encoding (input parameter encoding)
    * feature scaling : it scales the features
    * model training and evaluation : it uses a 5-fold Cross-Validation scheme and stores the OOF predictions 
    * model prediction : generates the predictions for the test data 
* cv_loop() can be used during the hyperparameter tuning (input parameter tuning=True) or to train one model and predict the target (sale price) for the estates of the Test Set (input parameter tuning=False).
* cv_loop() takes care of avoiding data leakage: data preparation (encoding, scaling, ...) happens within each fold of your cross validation cycle.

**I would be happy to get your feedback and ideas for improvement.**


<div style="display:fill; background-color:#000000;border-radius:5px;">
    <p style="font-size:300%; color:white;text-align:center";>Preliminaries</p>
</div>

# Import Libraries

In [None]:
import os
import warnings

# the usual libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import mutual_info_regression
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import optuna

# Define Settings

In [None]:
# Pandas setting
pd.set_option("display.max_columns", None)

# Mute warnings
# warnings.filterwarnings('ignore')

# Set Random State 
SEED = 42

# Optuna Parameters
TUNING_LASSO = False # When True, hyperparameter tuning  with Optuna will be executed. False to skip the tuning step
TUNING_XGB = False
TRAIN_TIME = 4 * 60 * 60 
N_TRIALS = 50
STUDY_NAME = 'STUDY'

# Define functions

In [None]:
def clean(df):
    
    df.drop(['Id'], axis=1, inplace=True)
    
    # Create features for Shed, Gar2 and Othr
    miscf = ['Shed','Gar2', 'Othr']
    for f in miscf:
        df['MiscFeature'+f] = df.apply(lambda x: x['MiscVal'] if x['MiscFeature']==f else np.NaN , axis=1)
    df.drop(['MiscFeature','MiscVal'], axis=1, inplace=True)
    
    # Fix typos
    df['Exterior1st'] = df['Exterior1st'].replace({"WdShing": "Wd Shng"})
    df["Exterior2nd"] = df["Exterior2nd"].replace({"CmentBd": "CemntBd"})
    df["Exterior2nd"] = df["Exterior2nd"].replace({"Brk Cmn": "BrkComm"})
    df["BldgType"] = df["BldgType"].replace({"Twnhs": "TwnhsI"})
    
    #Fix erroneous data
    df["GarageYrBlt"] = df["GarageYrBlt"].where(df.GarageYrBlt != 2207, 2007)
    
    return df

In [None]:
def load_data():
    
    train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
    X_train = train.copy()
    y_train = X_train['SalePrice'] #do not remove the target from X_train. We'll need it for target encoding.

    test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
    X_test = test.copy()

    X_train = clean(X_train)
    X_test = clean(X_test)

    all_features = ['MSSubClass','MSZoning','LotFrontage','LotArea','Street','Alley','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','MasVnrArea','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinSF1','BsmtFinType2','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','Heating','HeatingQC','CentralAir','Electrical','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','KitchenQual','TotRmsAbvGrd','Functional','Fireplaces','FireplaceQu','GarageType','GarageYrBlt','GarageFinish','GarageCars','GarageArea','GarageQual','GarageCond','PavedDrive','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','PoolQC','Fence','MoSold','YrSold','SaleType','SaleCondition','SalePrice','MiscFeatureShed','MiscFeatureGar2','MiscFeatureOthr'] #do not remove the target. We'll need it for target encoding.
    num_features = ['BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars','LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscFeatureGar2','MiscFeatureOthr','MiscFeatureShed','MoSold','YearBuilt','YearRemodAdd','YrSold','GarageYrBlt']
    num_continuous_features = ['LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea', 'GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscFeatureGar2','MiscFeatureOthr','MiscFeatureShed','MoSold','YearBuilt','YearRemodAdd','YrSold','GarageYrBlt']
    num_discrete_features=['BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars']
    cat_features = ['LotShape','LandContour','OverallQual','OverallCond','Utilities','LandSlope','ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure','HeatingQC','KitchenQual','FireplaceQu','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MSSubClass','MSZoning','Street','Alley','LotConfig','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','Foundation','BsmtFinType1','BsmtFinType2','Heating','CentralAir','Electrical','Functional','GarageType','PavedDrive','SaleType','SaleCondition']
    cat_features_to_encode = cat_features.copy()
    cat_features_to_encode.remove('OverallQual') #qualitative feature already in numeric format and correctly ordered
    cat_features_to_encode.remove('OverallCond') #qualitative feature already in numeric format and correctly ordered
    
    return train, test, X_train, y_train, X_test, all_features, num_features, num_continuous_features, num_discrete_features, cat_features, cat_features_to_encode

In [None]:
def cv_loop(
        X_train, 
        y_train,
        X_test,
        model, 
        useful_features,
        num_features,
        cat_features,
        cat_features_to_encode, 
        encoding = 'ohe', 
        new_features=[],
        scaling=False,
        clip=False, clipmin=np.log(34900), clipmax=np.log(755000),
        tuning=True,
        early_stopping=True
        ):
    
    y_train = np.log(y_train)
    
    num_features = np.intersect1d(num_features, useful_features)
    cat_features = np.intersect1d(cat_features, useful_features)
    cat_features_to_encode = np.intersect1d(cat_features_to_encode, useful_features)
    
    cum_rmse_val = 0
    iteration = 1
    
    N_SPLITS = 5
    
    kf = KFold(n_splits=N_SPLITS, shuffle=True, random_state=42)
    for train_index, val_index in kf.split(X_train, y_train): 
        X_train_, X_val_ = X_train.iloc[train_index], X_train.iloc[val_index]
        y_train_, y_val_ = y_train[train_index], y_train[val_index]

        X_train__ = X_train_.reset_index(drop=True)
        X_val__ = X_val_.reset_index(drop=True)

        # Select Features
        X_train__ = X_train__.loc[:, useful_features]
        X_val__ = X_val__.loc[:, useful_features]
        if(tuning==False): 
            useful_features_ = useful_features.copy()
            useful_features_.remove('SalePrice')
            X_test__ = X_test.loc[:, useful_features_]
        
        # Impute missing values
        if(new_features.count('NbNAs')==1):
            X_train__['NbNAs'] = X_train__.isnull().sum(axis=1)
            X_val__['NbNAs'] = X_val__.isnull().sum(axis=1)
            if(tuning==False): X_test__['NbNAs'] = X_test__.isnull().sum(axis=1)
        for col in num_features:
            X_train__[col] = X_train__[col].fillna(0)
            X_val__[col] = X_val__[col].fillna(0)
            if(tuning==False): X_test__[col] = X_test__[col].fillna(0)
        for col in cat_features:
            X_train__[col] = X_train__[col].fillna("None")
            X_val__[col] = X_val__[col].fillna("None")
            if(tuning==False): X_test__[col] = X_test__[col].fillna("None")
        
        # Create new features
        #if(new_features.count('NbNAs')==1):
            # Do nothing, managed when imputing NAs
        if(new_features.count('LivLotRatio')==1):
            X_train__['LivLotRatio'] = X_train__['GrLivArea'] / X_train__['LotArea']
            X_val__['LivLotRatio'] = X_val__['GrLivArea'] / X_val__['LotArea']
            if(tuning==False): X_test__['LivLotRatio'] = X_test__['GrLivArea'] / X_test__['LotArea']
        if(new_features.count('Spaciousness')==1):
            X_train__['Spaciousness'] = X_train__['GrLivArea'] / X_train__['TotRmsAbvGrd']
            X_val__['Spaciousness'] = X_val__['GrLivArea'] / X_val__['TotRmsAbvGrd']
            if(tuning==False): X_test__['Spaciousness'] = X_test__['GrLivArea'] / X_test__['TotRmsAbvGrd']
        # How big/small is the house in its neighboorhood
        if(new_features.count('MedNhbdArea')==1): #MedNhbdArea = median of the GrLivArea in the neighboorhood
            feat = X_train__.groupby('Neighborhood')['GrLivArea'].median()
            feat = feat.to_dict()
            X_train__.loc[:,'MedNhbdArea'] = X_train__['Neighborhood'].map(feat)
            X_val__.loc[:,'MedNhbdArea'] = X_val__['Neighborhood'].map(feat)
            if(tuning==False): X_test__.loc[:,'MedNhbdArea'] = X_test__['Neighborhood'].map(feat)
        if(new_features.count('GrLivAreaInNbhd')==1): #GrLivAreaInNbhd = GrLivArea - MedNhbdArea
            feat = X_train__.groupby('Neighborhood')['GrLivArea'].median()
            feat = feat.to_dict()
            X_train__.loc[:,'GrLivAreaInNbhd'] = X_train__['Neighborhood'].map(feat)
            X_train__['GrLivAreaInNbhd'] = X_train__['GrLivArea'] - X_train__['GrLivAreaInNbhd']
            X_val__.loc[:,'GrLivAreaInNbhd'] = X_val__['Neighborhood'].map(feat)
            X_val__['GrLivAreaInNbhd'] = X_val__['GrLivArea'] - X_val__['GrLivAreaInNbhd']
            if(tuning==False): 
                X_test__.loc[:,'GrLivAreaInNbhd'] = X_test__['Neighborhood'].map(feat)
                X_test__['GrLivAreaInNbhd'] = X_test__['GrLivArea'] - X_test__['GrLivAreaInNbhd']
        # How big/small is the lot in its neighboorhood
        if(new_features.count('MedNhbdArea_Ext')==1): #MedNhbdArea = median of the LotArea in the neighboorhood
            feat = X_train__.groupby('Neighborhood')['LotArea'].median()
            feat = feat.to_dict()
            X_train__.loc[:,'MedNhbdArea_Ext'] = X_train__['Neighborhood'].map(feat)
            X_val__.loc[:,'MedNhbdArea_Ext'] = X_val__['Neighborhood'].map(feat)
            if(tuning==False): X_test__.loc[:,'MedNhbdArea_Ext'] = X_test__['Neighborhood'].map(feat)
        if(new_features.count('LotAreaInNbhd')==1): #LotAreaInNbhd = LotArea - MedNhbdArea_Ext
            feat = X_train__.groupby('Neighborhood')['LotArea'].median()
            feat = feat.to_dict()
            X_train__.loc[:,'LotAreaInNbhd'] = X_train__['Neighborhood'].map(feat)
            X_train__['LotAreaInNbhd'] = X_train__['LotArea'] - X_train__['LotAreaInNbhd']
            X_val__.loc[:,'LotAreaInNbhd'] = X_val__['Neighborhood'].map(feat)
            X_val__['LotAreaInNbhd'] = X_val__['LotArea'] - X_val__['LotAreaInNbhd']
            if(tuning==False): 
                X_test__.loc[:,'LotAreaInNbhd'] = X_test__['Neighborhood'].map(feat)
                X_test__['LotAreaInNbhd'] = X_test__['LotArea'] - X_test__['LotAreaInNbhd']
        if(new_features.count('OverallQualCondProduct')==1):
            X_train__['OverallQualCondProduct'] = X_train__['OverallQual'] * X_train__['OverallCond']
            X_val__['OverallQualCondProduct'] = X_val__['OverallQual'] * X_val__['OverallCond']
            if(tuning==False): X_test__['OverallQualCondProduct'] = X_test__['OverallQual'] * X_test__['OverallCond']
        #LowQualFinRatio
        if(new_features.count('LowQualFinRatio')==1):
            X_train__['LowQualFinRatio'] = X_train__['LowQualFinSF'] / X_train__['GrLivArea']
            X_val__['LowQualFinRatio'] = X_val__['LowQualFinSF'] / X_val__['GrLivArea']
            if(tuning==False): X_test__['LowQualFinRatio'] = X_test__['LowQualFinSF'] / X_test__['GrLivArea']
        
        # Encode categorical variables
        if(encoding=='ohe'):
            enc = OneHotEncoder(handle_unknown = 'ignore')
            X_train__enc = pd.DataFrame(enc.fit_transform(X_train__[cat_features_to_encode]).toarray())
            X_val__enc = pd.DataFrame(enc.transform(X_val__[cat_features_to_encode]).toarray())
            X_train__enc.columns = enc.get_feature_names(cat_features_to_encode)
            X_val__enc.columns = enc.get_feature_names(cat_features_to_encode)
            X_train__ = X_train__.join(X_train__enc)
            X_val__ = X_val__.join(X_val__enc)
            if(tuning==False):
                X_test__enc = pd.DataFrame(enc.transform(X_test__[cat_features_to_encode]).toarray())
                X_test__enc.columns = enc.get_feature_names(cat_features_to_encode)
                X_test__ = X_test__.join(X_test__enc)
        elif(encoding=='ord'):
            #enc = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value=-1)
            enc = OrdinalEncoder()
            X_train__enc = pd.DataFrame(enc.fit_transform(X_train__[cat_features_to_encode]))
            X_val__enc = pd.DataFrame(enc.transform(X_val__[cat_features_to_encode]))
            X_train__enc.columns = [cat_features_to_encode]
            X_val__enc.columns = [cat_features_to_encode]
            X_train__ = X_train__.join(X_train__enc, rsuffix='_ord_enc')
            X_val__ = X_val__.join(X_val__enc, rsuffix='_ord_enc')
            if(tuning==False):
                X_test__enc = pd.DataFrame(enc.transform(X_test__[cat_features_to_encode]))
                X_test__enc.columns = [cat_features_to_encode]
                X_test__ = X_test__.join(X_test__enc, rsuffix='_ord_enc')
        elif(encoding=='tar_enc'):
            for f in cat_features_to_encode:
                feat = X_train__.groupby(f)['SalePrice'].mean()
                feat = feat.to_dict()
                X_train__.loc[:,f"tar_enc_{f}"] = X_train__[f].map(feat)
                X_val__.loc[:,f"tar_enc_{f}"] = X_val__[f].map(feat)
                if(tuning==False): X_test__.loc[:,f"tar_enc_{f}"] = X_test__[f].map(feat)

        X_train__.drop(columns='SalePrice', inplace=True)
        X_train__.drop(columns=cat_features_to_encode, inplace=True)
        X_val__.drop(columns='SalePrice', inplace=True)
        X_val__.drop(columns=cat_features_to_encode, inplace=True)
        if(tuning==False): X_test__.drop(columns=cat_features_to_encode, inplace=True)
                
        # Save files for debugging purpose
        X_train__.to_csv(f'X_train__{iteration}.csv', index = False)
        X_val__.to_csv(f'X_val__{iteration}.csv', index = False)
        if(tuning==False): X_test__.to_csv(f'X_test__{iteration}.csv', index = False)
        
        # Scale the features
        if scaling:
            scaler = StandardScaler()
            X_train__ = pd.DataFrame(scaler.fit_transform(X_train__), columns=X_train__.columns)
            X_val__ = pd.DataFrame(scaler.transform(X_val__), columns=X_val__.columns)
            if(tuning==False): X_test__ = pd.DataFrame(scaler.transform(X_test__), columns=X_test__.columns)
            # Save files for debugging purpose
            X_train__.to_csv(f'X_train__{iteration}__scaled.csv', index = False)
            X_val__.to_csv(f'X_val__{iteration}__scaled.csv', index = False)
            if(tuning==False): X_test__.to_csv(f'X_test__{iteration}__scaled.csv', index = False)
                
        # Train and Predict
        if early_stopping:
            model.fit(X_train__, y_train_, eval_set=[(X_val__, y_val_)], early_stopping_rounds=100, verbose=False)
        else:
            model.fit(X_train__, y_train_) 
        
        y_val_preds = model.predict(X_val__)

        if(clip): y_val_preds = np.clip(y_val_preds,clipmin,clipmax)
        
        if(tuning==False):
            if iteration==1: 
                oof_preds = pd.Series(data=y_val_preds,index=val_index)
            else: 
                oof_preds = pd.concat([oof_preds, pd.Series(data=y_val_preds,index=val_index)])
        
        rmse_val = mean_squared_error(y_val_, y_val_preds, squared=False)
        print(str(iteration) + '/' + str(N_SPLITS) + ' KFold RMSLE: ' + str(rmse_val))
            
        cum_rmse_val = cum_rmse_val + rmse_val
        
        if(tuning==False):
            new_preds = model.predict(X_test__)
            if(clip): new_preds = np.clip(new_preds,clipmin,clipmax)
        
            if iteration==1: 
                preds = new_preds
            else: 
                preds = preds + new_preds
                
        iteration = iteration + 1

    if(tuning==False): preds = preds/N_SPLITS
    avg_rmse = cum_rmse_val/N_SPLITS
    print('Average RMSLE: ' + str(avg_rmse))
    
    if tuning:
        return avg_rmse, None, None
    else:
        return avg_rmse, np.exp(oof_preds.sort_index()), np.exp(preds)

<div style="display:fill; background-color:#000000;border-radius:5px;">
    <p style="font-size:300%; color:white;text-align:center";>EDA</p>
</div>

The training set has 1460 entries and the test set has 1459. The target is the "SalePrice". Let's display the 2 first rows:

In [None]:
_, _, X_train, y_train, X_test, all_features, num_features, num_continuous_features, num_discrete_features, cat_features, cat_features_to_encode = load_data()
X_train.head(2)

There are different type of features: categorical (ordered and unordered) and numerical (continuous and discrete and time).

In [None]:
feature_types = pd.DataFrame(data=[num_features, cat_features])
feature_types.index=['Numerical','Categorical']
feature_types.style.set_table_styles([
    {'selector': 'thead', 'props': [('display', 'none')]}
])

* The categorical data needs encoding with the exception of 'OverallQual', 'OverallCond'. Indeed, these 2 variables are already numeric values and their enconding make sense : 10 for 'Very Excellent', 9 for 'Excellent', 8 for 'Very Good'...  
* We can use OneHotEncoding and TargetEncoding for the the categorical variables that need encoding. We may also use the model's built-in capabilities to encode categorical variables when we use models like XGB or LGBM.

# Target Distribution

In [None]:
fig=plt.figure(figsize=(20,4))
sns.histplot(y_train, kde=True, palette='pastel').set_title('Target Distribution', weight='bold')
plt.show()

# Feature Distribution

Let's not display all the distributions below but just a few to illustrate the conclusion:

In [None]:
full = pd.concat([X_train, X_test]).reset_index()
full.drop(columns='index')
full['Source']='train'
full.loc[full[full['SalePrice'].isna()].index,'Source']='test'

fig=plt.figure(figsize=(20, 7))
#fig.suptitle("Histograms of numeric continuous features")
rows = 2
cols = 5
i = 1
for f in num_continuous_features[0:10]:
    fig.add_subplot(rows, cols, i)
    sns.histplot(data=full, x=f, hue='Source', kde=True, palette='pastel')
    i = i+1
plt.show()

* The distributions of the features are similar between train and test sets. So if we find a model that performs well on the CV, we can expect this model to perform equally well on the test set.
* Most distributions are skewed and the value ranges differ. As long as we use tree-based models, skewness and variance are not an issue. But we'll have to consider scaling and normalizing the numeric features if we come to use models that perform well when the features look more or less like standard normally distributed data.

# Missing values

In [None]:
f_with_na_train = X_train.isna().sum(axis=0)
f_with_na_train = f_with_na_train[f_with_na_train>0]
f_with_na_train.name='Nb of NaNs in train'
f_with_na_test = X_test.isna().sum(axis=0)
f_with_na_test = f_with_na_test[f_with_na_test>0]
f_with_na_test.name='Nb of NaNs in test'
f_with_na = pd.concat([f_with_na_train, f_with_na_test], axis=1)
f_with_na.fillna(0, inplace=True)
f_with_na = f_with_na[['Nb of NaNs in train','Nb of NaNs in test']]
f_with_na.sort_values(['Nb of NaNs in train', 'Nb of NaNs in test'], ascending=False)[['Nb of NaNs in train', 'Nb of NaNs in test']].plot(kind='bar', figsize=(20,4))
plt.show()

* There are a lot of features with missing values, either in the train or the test set but most of the time, in both.
* For this dataset, imputing 0 for missing numeric values and None for categorical value is an appropriate choice.
* There are a few, namely 'MiscFeatureOthr', 'MiscFeatureGar2', 'PoolQC', 'MiscFeatureShed', 'Alley', 'Fence' with a huge ratio of missing values. It may be a good choice to ignore these features in the model design as they may lead to overfitting.

# Feature and Target correlation

We can use the mutual information (MI) between each feature and the target to quantify the "amount of information" obtained about the target by observing the features (univariate analysis). This method is better than corr() a.k.a. Pearson's correlation coefficient as it considers linear and non-linear correlations. Using this approach makes more sense to add the categorical unordered features.

Like the Pearson's Correlation Coefficient, having a MI close to 0 does not mean that the variable is uninformative. Maybe the interaction of this variable with another one is informative.

In [None]:
X = X_train.copy()
y = X.pop("SalePrice")
# We are going to factorize the categorical features
cat_features_to_encode = cat_features.copy()
cat_features_to_encode.remove('OverallQual')
cat_features_to_encode.remove('OverallCond')
for colname in cat_features_to_encode:
    X[colname], _ = X[colname].factorize() #NaNs will be replaced by -1
# Need to remove the NaNs in the numerical features
X.fillna(0, inplace=True)
# Calculate and plot MI scores
mi_scores = mutual_info_regression(X, y, discrete_features='auto', random_state=0)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
fig=plt.figure(figsize=(15, 5))
mi_scores.plot(kind='bar', figsize=(20,4))
plt.show()

* There are many features that have a pretty high correlation with the target. One should be able to set up a model that performs "well" in predicting the target on this dataset.
* We may drop the features with a very low MI as it is likely that they are uninformative (but this analysis is not sufficient to say this)
* It is also interesting to realize that OverallQual, Neighboorhood, GrLivArea and YearBuilt are the key drivers of the estate price. 

<div style="display:fill; background-color:#000000;border-radius:5px;">
    <p style="font-size:300%; color:white;text-align:center";>XGBRegressor</p>
</div>

In [None]:
_, _, X_train, y_train, X_test, all_features, num_features, _, _, cat_features, cat_features_to_encode = load_data()
useful_features = [e for e in all_features if e not in ('PoolArea','3SsnPorch','MoSold','YrSold','RoofMatl','Utilities','MiscFeatureGar2','PoolQC')]
new_features = ['NbNAs','LivLotRatio','Spaciousness','MedNhbdArea','GrLivAreaInNbhd','MedNhbdArea_Ext','LotAreaInNbhd','OverallQualCondProduct','LowQualFinRatio']
encoding = 'tar_enc'
scaling = True #useless for tree based algo
clip = False
early_stopping = True

# XGBRegressor - Hyperparameter tuning with Optuna

Optuna is an open source and SOTA hyperparameter optimization framework to automate hyperparameter search. I found it easy to use and it has plenty of plots to analyze the outputs of the study i.e. optimization task. I did not test its capability to prune unpromising trials for faster results though.

In [None]:
def objective(trial):
    
    param_grid = {
        'max_depth': trial.suggest_int('max_depth', 3, 15), # default = 6 range = [0,∞]
        'n_estimators': trial.suggest_int('n_estimators', 1000, 10000), # default = 100
        'eta': trial.suggest_loguniform('eta', 0.001, 0.3), # default = 0.3 range = [0,1]
        'subsample': trial.suggest_discrete_uniform('subsample', 0.1, 1.0, 0.1), # default = 1 range = (0, 1]
        'colsample_bytree': trial.suggest_discrete_uniform('colsample_bytree', 0.2, 1.0, 0.1), # default = 1 range = (0, 1]
        'colsample_bylevel': trial.suggest_discrete_uniform('colsample_bylevel', 0.2, 1.0, 0.1), # default = 1 range = (0, 1]
        'min_child_weight': trial.suggest_loguniform('min_child_weight', 0.1, 10), # default = 1 range: [0,∞]
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 0.1, 100), # default = 1
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 0.0001, 10), # default=0
        'gamma': trial.suggest_loguniform('gamma', 0.001, 10), # default = 0 range: [0,∞]
    }  
    
    model = XGBRegressor(
        tree_method='gpu_hist',
        predictor='gpu_predictor',
        n_jobs=4,
        **param_grid
    )
    
    avg_rmse, _, _ = cv_loop(
        X_train=X_train, 
        y_train=y_train,
        X_test=X_test,
        model=model, 
        useful_features=all_features,
        num_features=num_features,
        cat_features=cat_features,
        cat_features_to_encode=cat_features_to_encode, 
        encoding = encoding, 
        new_features=new_features,
        scaling = scaling,
        clip = clip, clipmin=np.log(34900), clipmax=np.log(755000),
        tuning=True,
        early_stopping = early_stopping
    )

    return avg_rmse

In [None]:
if TUNING_XGB:
    study = optuna.create_study(direction='minimize', study_name=STUDY_NAME)
    study.optimize(objective, timeout=TRAIN_TIME)
    
    print('Number of finished trials: ', len(study.trials))
    print('Best trial:')
    trial = study.best_trial

    print('\tValue: {}'.format(trial.value))
    print('\tParams: ')
    for key, value in trial.params.items():
        print('\t\t{}: {}'.format(key, value))

In [None]:
if TUNING_XGB :
    fig = optuna.visualization.plot_contour(study, params=['max_depth','n_estimators'])
    fig.show()
    fig = optuna.visualization.plot_contour(study, params=['eta','subsample'])
    fig.show()
    fig = optuna.visualization.plot_contour(study, params=['min_child_weight','gamma'])
    fig.show()

# XGBRegressor - Generating predictions with the best model

In [None]:
if TUNING_XGB:
    model = XGBRegressor(
        tree_method='gpu_hist',
        predictor='gpu_predictor',
        n_jobs=4,
        **trial.params)
else: 
    params = {
        'max_depth': 5,
        'n_estimators': 7779,
        'eta': 0.0044144556312306175,
        'subsample': 0.30000000000000004,
        'colsample_bytree': 0.2,
        'colsample_bylevel': 0.4,
        'min_child_weight': 0.21792841014662054,
        'reg_lambda': 5.06808562586094,
        'reg_alpha': 0.036826697275635915,
        'gamma': 0.002452743312016066,    
    }

    model = XGBRegressor(
        tree_method='gpu_hist',
        predictor='gpu_predictor',
        n_jobs=4,
        **params)
    
avg_rmse, oof_preds, preds = cv_loop(
        X_train=X_train, 
        y_train=y_train,
        X_test=X_test,
        model=model, 
        useful_features=all_features,
        num_features=num_features,
        cat_features=cat_features,
        cat_features_to_encode=cat_features_to_encode, 
        encoding = encoding, 
        new_features=new_features,
        scaling = scaling,
        clip = clip, clipmin=np.log(34900), clipmax=np.log(755000),
        tuning=False,
        early_stopping = early_stopping
    )

In [None]:
# Save the preds file
submission = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
submission['SalePrice'] = preds
submission.to_csv('xgb_preds.csv', index = False)

# Save the oof preds file
oof_preds.to_csv('xgb_oof_preds.csv', header=False)

<div style="display:fill; background-color:#000000;border-radius:5px;">
    <p style="font-size:300%; color:white;text-align:center";>Lasso</p>
</div>

In [None]:
_, _, X_train, y_train, X_test, all_features, num_features, _, _, cat_features, cat_features_to_encode = load_data()
useful_features = [e for e in all_features if e not in ('PoolArea','3SsnPorch','MoSold','YrSold','RoofMatl','Utilities','MiscFeatureGar2','PoolQC')]
new_features = ['NbNAs','LivLotRatio','Spaciousness','MedNhbdArea','GrLivAreaInNbhd','MedNhbdArea_Ext','LotAreaInNbhd','OverallQualCondProduct','LowQualFinRatio']
encoding = 'ohe'
scaling = True
clip = True # with LR models, better results when clipping
early_stopping = False

# Lasso - Hyperparameter tuning with Optuna

In [None]:
def objective(trial):
    
    param_grid = {
        'alpha': trial.suggest_loguniform('alpha', 0.0001, 10000), # default = 1
        #'max_iter': trial.suggest_discrete_uniform('max_iter', 10000, 50000), # default = 1000
        'max_iter': trial.suggest_loguniform('max_iter', 1000, 900000), # default = 1000
        'random_state' : 42
    } 
    
    model = Lasso(
        **param_grid
    )
    
    avg_rmse, _, _ = cv_loop(
        X_train = X_train, 
        y_train = y_train,
        X_test = X_test,
        model = model, 
        useful_features = useful_features,
        num_features = num_features,
        cat_features = cat_features,
        cat_features_to_encode = cat_features_to_encode, 
        encoding = encoding, 
        new_features = new_features,
        scaling = scaling,
        clip = clip, clipmin=np.log(34900), clipmax=np.log(755000), #necessary to avoid exception?
        tuning = True,
        early_stopping = early_stopping
    )

    return avg_rmse

In [None]:
if TUNING_LASSO:
    study = optuna.create_study(direction='minimize', study_name=STUDY_NAME)
    #study.optimize(objective, timeout=TRAIN_TIME)
    study.optimize(objective, n_trials=50)
    
    print('Number of finished trials: ', len(study.trials))
    print('Best trial:')
    trial = study.best_trial

    print('\tValue: {}'.format(trial.value))
    print('\tParams: ')
    for key, value in trial.params.items():
        print('\t\t{}: {}'.format(key, value))

In [None]:
if TUNING_LASSO:
    fig = optuna.visualization.plot_contour(study, params=['max_iter','alpha'])
    fig.show()

# Lasso - Generating predictions with the best model

In [None]:
if TUNING_LASSO:
    model = Lasso(**trial.params, random_state=42)
else:
    params = { 
        'alpha': 0.0018185000964940012,
        'max_iter': 21098,
        'random_state' : 42
    }

    model = Lasso(**params)
    
avg_rmse, oof_preds, preds = cv_loop(
        X_train = X_train, 
        y_train = y_train,
        X_test = X_test,
        model = model, 
        useful_features = useful_features,
        num_features = num_features,
        cat_features = cat_features,
        cat_features_to_encode = cat_features_to_encode, 
        encoding = encoding, 
        new_features = new_features,
        scaling = scaling,
        clip = clip, clipmin = np.log(34900), clipmax = np.log(755000),
        tuning = False,
        early_stopping = early_stopping
    )

In [None]:
# Save the preds file
submission = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
submission['SalePrice'] = preds
submission.to_csv('lasso_preds.csv', index = False)

# Save the oof preds file
oof_preds.to_csv('lasso_oof_preds.csv', header=False)

<div style="display:fill; background-color:#000000;border-radius:5px;">
    <p style="font-size:300%; color:white;text-align:center";>Displaying predicitions</p>
</div>

In this section, I'm using the OOF predictions. The plots below show the OOF predictions vs the Ground Truth for the 2 models (XGB and Lasso) :

In [None]:
xgb_oof_preds = np.log(pd.read_csv('xgb_oof_preds.csv',header=None).iloc[:,1])
lasso_oof_preds = np.log(pd.read_csv('lasso_oof_preds.csv',header=None).iloc[:,1])
oof_preds = [xgb_oof_preds, lasso_oof_preds]
plot_titles = ['xgb_oof_preds', 'lasso_oof_preds']

fig, ax = plt.subplots(1, 2, figsize=(20,5))
i = 0
for oof_pred in oof_preds:
    ax[i].scatter(x=oof_pred, y=np.log(y_train))
    
    lims = [
        np.min([ax[i].get_xlim(), ax[i].get_ylim()]),  # min of both axes
        np.max([ax[i].get_xlim(), ax[i].get_ylim()]),  # max of both axes
    ]

    # now plot both limits against eachother
    ax[i].plot(lims, lims, 'k-', alpha=0.75, zorder=0)
    ax[i].set_aspect('equal')
    ax[i].set_xlim(lims)
    ax[i].set_ylim(lims)
    ax[i].set_title(plot_titles[i])
    ax[i].set_xlabel('Log(Sale Price) - Prediction')
    ax[i].set_ylabel('Log(Sale Price) - Ground Truth')
    i = i+1
plt.show()

The plot below displays the Ground Truth in green, the Lasso predictions in red and the XGB predictions in blue. 

In [None]:
fig=plt.figure(figsize=(20,8))
sns.scatterplot(x=lasso_oof_preds.index, y=np.log(y_train), color='green')
sns.scatterplot(x=lasso_oof_preds.index, y=lasso_oof_preds, color='red')
sns.scatterplot(x=lasso_oof_preds.index, y=xgb_oof_preds, color='blue')
plt.show()

In the heatmap below, we can see that the XGB predictions and the Lasso predictions have a high correlation (98%). This is not the ideal condition for stacking to be efficient (stacking is appropriate when multiple different machine learning models have skill on a dataset, but have skill in different ways. Another way to say this is that the predictions made by the models or the errors in predictions made by the models are uncorrelated or have a low correlation) but we will test it in the next section and check if any improvement comes out of it.

In [None]:
oof_preds = pd.concat([xgb_oof_preds, lasso_oof_preds, np.log(y_train)], axis=1)
oof_preds.columns = ['xgb_oof_preds','lasso_oof_preds','y_train']
sns.heatmap(oof_preds.corr(), annot=True)
plt.show()

<div style="display:fill; background-color:#000000;border-radius:5px;">
    <p style="font-size:300%; color:white;text-align:center";>Stacking</p>
</div>

In [None]:
X_train, y_train = oof_preds[['xgb_oof_preds','lasso_oof_preds']], oof_preds['y_train']
xgb_preds = pd.read_csv('xgb_preds.csv').iloc[:,1]
lasso_preds = pd.read_csv('lasso_preds.csv').iloc[:,1]
X_test = pd.concat([np.log(xgb_preds), np.log(lasso_preds)], axis=1)

# Blending the 2 models (XGB and Lasso) with a simple Linear Regression

In [None]:
metamodel = LinearRegression()
cum_rmse_val = 0
iteration = 1
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_index, val_index in kf.split(X_train, y_train):
    X_train_, X_val_ = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_, y_val_ = y_train[train_index], y_train[val_index]
    metamodel.fit(X_train_, y_train_)
    y_val_preds = metamodel.predict(X_val_)
    rmse_val = mean_squared_error(y_val_, y_val_preds, squared=False)
    print(rmse_val)
    cum_rmse_val = cum_rmse_val + rmse_val
    
    new_preds=metamodel.predict(X_test)
    if(iteration==1):
        preds = new_preds
    else:  
        preds = preds + new_preds
    iteration = iteration+1

print(cum_rmse_val/5)
preds = preds / 5    

In [None]:
# Save the preds file
submission = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
submission['SalePrice'] = np.exp(preds)
submission.to_csv('stacking_preds.csv', index = False)