# XGBoost & Hyperparameter tuning


* [1. Loading and Inspecting Data](#2.-Loading-and-Inspecting-Data)
* [2. Data preprocessing](#3.-Data-preprocessing)
* [2.1 Fill NaN values](#3.1-Fill-NaN-values)
* [2.2 Encoding ordinal features](#3.2-Encoding-ordinal-features)
* [2.3 Encode nominal features](#3.3-Encode-nominal-features)
* [3. Feature Engineering](#4.-Feature-Engineering)
* [4. Normalize](#5.-Normalize)
* [5. Fit Models](#6.-Fit-Models)
* [5.1 Base line model](#6.1-Base-line-model)
* [5.2 XGBoost](#6.2-XGBoost)
    * [Parameters](#Parameters)
    * [Tuning the hyper-parameters](#Tuning-the-hyper-parameters)
    * [Best Fit](#Best-Fit)
* [6. Compare Models](#7.-Compare-Models)
* [7. Plot Results](#8.-Plot-Results)
* [8. Predic Test & Submission](#9.-Predic-Test-&-Submission)





<br>Reference:</br>
<br>https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
<br>https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
<br>https://scikit-learn.org/stable/modules/grid_search.html#multimetric-grid-search
<br>https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# 1. Loading and Inspecting Data

In [None]:
#Importing packages
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import os

import warnings
warnings.filterwarnings("ignore")

In [None]:
#Load dataset
train = pd.read_csv("../input/train.csv")
test  = pd.read_csv("../input/test.csv")

In [None]:
#Dataset shape
print('Train %s\nTest %s' % (train.shape, test.shape))

In [None]:
#Feature to predict
ft_pred = list(set(train.columns) - set(test.columns))
ft_pred

In [None]:
train[ft_pred].describe()

In [None]:
#Plot GrLivArea vs SalePrice
plt.scatter(train['GrLivArea'], train['SalePrice'], color='blue', alpha=0.5)
plt.title("LotArea vs SalePrice")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

# 2. Data preprocessing

<ul>
    <li>First I'll replace the numeric missing values (NaN's) with 0 and non numeric with none.
    <li>Create Dummy variables for the categorical features.
    <li>transform the skewed numeric features by taking log(feature + 1) - this will make the features more normal.
</ul>

## 2.1 Fill NaN values

In [None]:
#save and drop id
train_id = train["Id"]
train.drop(columns='Id',inplace=True)

test_id = test["Id"]
test.drop(columns='Id',inplace=True)

#select object columns
obj_col = train.columns[train.dtypes == 'object'].values

#select non object columns
num_col = train.columns[train.dtypes != 'object'].values
num_col_test = test.columns[test.dtypes != 'object'].values

#replace null value in obj columns with None
train[obj_col] = train[obj_col].fillna('None')
test[obj_col] = test[obj_col].fillna('None')

#replace null value in numeric columns with 0
train[num_col] = train[num_col].fillna(0)
test[num_col_test] = test[num_col_test].fillna(0)

train_001 = train
test_001 = test

## 2.2 Encoding ordinal features

In [None]:
import category_encoders as ce

#Ordinal features
ordinal_features = ["ExterQual","ExterCond","BsmtQual","BsmtCond","BsmtExposure", "BsmtFinType1","BsmtFinType2",
                    "HeatingQC","Electrical","KitchenQual", "FireplaceQu","GarageQual","GarageCond","PoolQC"]

#Split X,y
train_002_X = train_001.drop(ft_pred, axis=1)
train_002_y = train_001[ft_pred]

ce_one_hot = ce.OrdinalEncoder(cols = ordinal_features)

train_003 = pd.concat([ce_one_hot.fit_transform(train_002_X), train_002_y], axis=1, sort=False)
test_003  = ce_one_hot.transform(test_001)


## 2.3 Encode nominal features

In [None]:
#Nominal features
nominal_features = [x for x in obj_col if x not in ordinal_features]

#Transfer object to int
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

#for loop nominal feature column
for i in train_003[nominal_features].columns:
    #fit and transform each column and assign to itself
    train_003[i] = labelencoder.fit_transform(train_003[i])
    
#for loop nominal feature column
for i in test_003[nominal_features].columns:
    #fit and transform each column and assign to itself
    test_003[i] = labelencoder.fit_transform(test_003[i])
    
#Get dummy variable for nominal features
train_005 = pd.get_dummies(train_003,columns=nominal_features,drop_first=True)
test_005 = pd.get_dummies(test_003,columns=nominal_features,drop_first=True)

In [None]:
#Only for test set
#Check if any null values
print(train_005.isnull().any().sum())
print(test_005.isnull().any().sum())

#Get missing columns in the training test
missing_cols = set(train_005.drop(columns="SalePrice").columns) - set(test_005.columns)

#Add a missing column in test set with default value equal to 0
for cols in missing_cols:
    test_005[cols] = 0
    
#Ensure the order of column in the test set is in the same order than in train set
test_005 = test_005[train_005.drop(columns="SalePrice").columns]

# 3. Feature Engineering

In [None]:


#TotalBath
train_005['TotalBath'] = (train_005['FullBath'] + train_005['HalfBath'] + train_005['BsmtFullBath'] + train_005['BsmtHalfBath'])
test_005['TotalBath']  = (test_005['FullBath']  + test_005['HalfBath']  + test_005['BsmtFullBath']  + test_005['BsmtHalfBath'])

#TotalPorch
train_005['TotalPorch'] = (train_005['OpenPorchSF'] + train_005['3SsnPorch'] + train_005['EnclosedPorch'] + train_005['ScreenPorch'] + train_005['WoodDeckSF'])
test_005['TotalPorch']  = (test_005['OpenPorchSF']  + test_005['3SsnPorch']  + test_005['EnclosedPorch']  + test_005['ScreenPorch']    + test_005['WoodDeckSF'])

#Modeling happen during the sale year
train_005["RecentRemodel"] = (train_005["YearRemodAdd"] == train_005["YrSold"]) * 1
test_005["RecentRemodel"]  = (test_005["YearRemodAdd"]  == test_005["YrSold"]) * 1

#House sold in the year it was built
train_005["NewHouse"] = (train_005["YearBuilt"] == train_005["YrSold"]) * 1
test_005["NewHouse"]  = (test_005["YearBuilt"]  == test_005["YrSold"]) * 1

#YrBltAndRemod
train_005["YrBltAndRemod"] = train_005["YearBuilt"] + train_005["YearRemodAdd"]
test_005["YrBltAndRemod"]  = test_005["YearBuilt"]  + test_005["YearRemodAdd"]

#Total_sqr_footage
train_005["Total_sqr_footage"] = train_005["BsmtFinSF1"] + train_005["BsmtFinSF2"] + train_005["1stFlrSF"] + train_005["2ndFlrSF"]
test_005["Total_sqr_footage"]  = test_005["BsmtFinSF1"]  + test_005["BsmtFinSF2"]  + test_005["1stFlrSF"]  + test_005["2ndFlrSF"]

#HasPool
train_005['HasPool'] = train_005['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
test_005['HasPool']  = test_005['PoolArea'].apply(lambda x: 1 if x > 0 else 0)

#HasFireplaces
train_005['HasFirePlace'] = train_005['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)
test_005['HasFirePlace']  = test_005['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

#Has2ndFloor
train_005['Has2ndFloor'] = train_005['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
test_005['Has2ndFloor']  = test_005['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)

#HasGarage
train_005['HasGarage'] = train_005['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
test_005['HasGarage']  = test_005['GarageArea'].apply(lambda x: 1 if x > 0 else 0)

#HasBsmnt
train_005['HasBsmnt'] = train_005['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
test_005['HasBsmnt']  = test_005['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)

# 4. Split dataframe

In [None]:
#Importing packages
from sklearn.model_selection import train_test_split

X = train_005.drop(columns="SalePrice")
y = train_005["SalePrice"]

#Particiona o data set originalmente Train em Train(Treino) e Val(validação)
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.25,random_state=0)

In [None]:
X_train.shape, X_val.shape

# 5. Outlier Detection

Perhaps the most important hyperparameter in the model is the “contamination” argument, which is used to help estimate the number of outliers in the dataset. This is a value between 0.0 and 0.5 and by default is set to 0.1.

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM

#Isolation Forest

# identify outliers in the training dataset
#iso = IsolationForest(contamination=0.01)
#yhat = iso.fit_predict(X_train)

#Minimum Covariance Determinant

# identify outliers in the training dataset
ee = EllipticEnvelope(contamination=0.01)
yhat = ee.fit_predict(X_train)

#Local Outlier Factor

# identify outliers in the training dataset
#lof = LocalOutlierFactor()
#yhat = lof.fit_predict(X_train)

#One-Class SVM

# identify outliers in the training dataset
#ee = OneClassSVM(nu=0.01)
#yhat = ee.fit_predict(X_train)

# select all rows that are not outliers
mask = yhat != -1
X_train_001, y_train_001 = X_train[mask], y_train[mask]

# select all rows that are outliers
masko = yhat == -1
X_train_o, y_train_o = X_train[masko], y_train[masko]

# summarize the shape of the updated training dataset
print(X_train_001.shape, y_train_001.shape)

In [None]:
#Plot GrLivArea vs SalePrice
plt.scatter(X_train_001['GrLivArea'], y_train_001, color='blue', alpha=0.5)
plt.scatter(X_train_o['GrLivArea'], y_train_o, color='red', alpha=0.5, label='outlier')
plt.legend(loc="upper left")
plt.title("GrLivArea vs SalePrice")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

# 6. Fit Models

## 5.2 XGBoost

In [None]:
#Importing Packages
import matplotlib.pyplot as plt
import numpy as np

import xgboost as xgb
from xgboost import XGBRegressor
from xgboost import XGBRFRegressor
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

#from sklearn.preprocessing import Imputer#

### Parameters

<b>Default parameters</b>
<br>max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, silent=None, objective='reg:squarederror', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, importance_type='gain'

**GridSearchCV params:**
* **estimator:** estimator object
* **param_grid :** dict or list of dictionaries
* **scoring:** A single string or a callable to evaluate the predictions on the test set. If None, the estimator’s score method is used.
    * https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
* **n_jobs:** Number of jobs to run in parallel. None means. -1 means using all processors.
* **cv:** cross-validation, None, to use the default 3-fold cross validation. Integer, to specify the number of folds in a (Stratified)KFold.

In [None]:
#XGBoost hyper-parameter tuning
def hyperParameterTuning(X_train_001, y_train_001):
    param_tuning = {
        'objective': ['reg:squarederror'],
        'colsample_bytree': [0.2, 0.5, 1],
        'subsample': [0.7, 1],
        'learning_rate': [0.05, 0.1, 0.3],
        'max_depth': [3, 6, 8],
        'min_child_weight': [0, 1, 10],
        'n_estimators' : [700, 1000, 5000]
    }

    xgb_model = XGBRegressor(tree_method='gpu_hist')

    gsearch = GridSearchCV(estimator = xgb_model,
                           param_grid = param_tuning,                        
                           #scoring = 'neg_mean_absolute_error', #MAE
                           #scoring = 'neg_mean_squared_error',  #MSE
                           cv = 3,
                           n_jobs = -1,
                           verbose = 10)

    gsearch.fit(X_train_001,y_train_001)

    return gsearch.best_params_

In [None]:
#Run only in the first run of the kernel.
#hyperParameterTuning(X_train, y_train)

### Best Params
{'colsample_bytree': 0.7,
 'learning_rate': 0.01,
 'max_depth': 10,
 'min_child_weight': 5,
 'n_estimators': 500,
 'subsample': 0.5}
 <br>
 {'colsample_bytree': 1,
 'learning_rate': 0.05,
 'max_depth': 8,
 'min_child_weight': 0,
 'n_estimators': 700,
 'objective': 'reg:squarederror',
 'subsample': 0.7}
 <br>
 {'colsample_bytree':0.01, 'n_estimators':3460,
                                     'max_depth':3, 'min_child_weight':0,
                                     'gamma':0, 'subsample':0.7,
                                     'colsample_bytree':0.7,
                                     'objective':'reg:linear', 'nthread':-1,
                                     'scale_pos_weight':1, 'seed':27,
                                     'reg_alpha':0.00006)

### Best Fit

In [None]:
XGBReg_def = XGBRegressor(objective = 'reg:squarederror', 
                        tree_method='gpu_hist')

XGBReg_t01 =  XGBRegressor(objective = 'reg:squarederror', 
                        colsample_bytree = 0.7, 
                        learning_rate = 0.01, 
                        max_depth = 10, 
                        min_child_weight = 5, 
                        n_estimators = 500, 
                        subsample = 0.5,
                        seed=27,
                        tree_method='gpu_hist')

XGBReg_t02 =  XGBRegressor(learning_rate=0.01,
                           n_estimators=3000,
                           max_depth=5, 
                           min_child_weight=0,
                           gamma=0, 
                           subsample=0.7,                                
                           colsample_bytree=0.7,                                     
                           objective='reg:squarederror',                                
                           scale_pos_weight=1, 
                           seed=27,                                     
                           reg_alpha=0.00006,
                           tree_method='gpu_hist')

In [None]:
xgb_model_t01 = XGBReg_t01
xgb_model_t02 = XGBReg_t02

%time xgb_model_t01.fit(X_train_001, y_train_001, early_stopping_rounds=10, eval_set=[(X_val, y_val)], verbose=False)
%time xgb_model_t02.fit(X_train_001, y_train_001, early_stopping_rounds=10, eval_set=[(X_val, y_val)], verbose=False)

y_pred_xgb_t01 = xgb_model_t01.predict(X_val)
y_pred_xgb_t02 = xgb_model_t02.predict(X_val)

mae_xgb_t01 = mean_absolute_error(y_val, y_pred_xgb_t01)
mae_xgb_t02 = mean_absolute_error(y_val, y_pred_xgb_t02)

print("MAE t01: ", mae_xgb_t01)
print("MAE t02: ", mae_xgb_t02)

# 7. Join models

In [None]:
y_pred = 0.5*y_pred_xgb_t01 + 0.5*y_pred_xgb_t02

mae_xgb = mean_absolute_error(y_val, y_pred)

In [None]:
print(mae_xgb)

# 7. Plot Results

In [None]:
#Plot Real vs Predict
plt.scatter(X_val['GrLivArea'], y_val,          color='blue', label='Real',    alpha=0.5)
plt.scatter(X_val['GrLivArea'], y_pred,  color='red' , label='Predict', alpha=0.5)
plt.title("Real vs Predict")
plt.legend(loc='best')
plt.show()

In [None]:
#Feature importance 
for model in [xgb_model_t01, xgb_model_t02]:
    xgb.plot_importance(model, max_num_features=20)
    plt.title("xgboost.plot_importance(model)")
    plt.show()

# 8. Predic Test & Submission

In [None]:
X_test = test_005

# Use the model to make predictions
y_pred_test_t01 = xgb_model_t01.predict(X_test)
y_pred_test_t02 = xgb_model_t02.predict(X_test)

y_pred_test = 0.4*y_pred_test_t01 + 0.6*y_pred_test_t02

submission = pd.DataFrame({'Id':test_id,'SalePrice':y_pred_test})

# Save results
submission.to_csv("submission.csv",index=False)