# Milestone 4 - Additional Algorithm: CatBoost

This notebook corresponds to the fourth and final stage of the Machine Learning final project, as part of the Copernicus Master in Digital Earth, in the Data Science track at University of South Brittany, Vannes, France by Candela Sol PELLIZA & Rajeswari PARASA.

In this milestone we present a new algorithm, CatBoost...
### COMPLETE INTRO


## 1. Presenting the new algorithm: CatBoost

### 1.a. Literal Description

For this final step of the project we introduce the CatBoost algorithm as a promising technique for predicting house prices. 

CatBoost (an acronym derived from "Categorical Boosting") is a gradient boost decision trees algorithm, which has been specifically designed to effectively handle categorical variables, through a technique called "ordered boosting".

A gradient-boosted trees algorithm means mainly three things: 

**Boosting:** Attempts to predict a target accurately by building a sequence of weak and simple models, improving on each model based on the error comitted on the previous one.

**Trees:** Uses binary decision trees as base predictors

**Gradient:** The improvements between one and the next model are calculated using the gradient descent method.

However, while CatBoost shares these bases with other well-known algorithms, such as XGBoost or LighGBM, it differs from those models by two main factors:

**Ordered Boosting:** While some other similar algorithms have the ability to handle categrical variables, they do it generally through a technique called Target Statistics (or variations of it), which has been demostrated to have a prediction shift due to target leakage. CatBoost presents an alternative that addresses this issue by means of "ordered boosting". This technique relies on the ordering principle, introducing a random permutation of the training samples at the different steps of gardient boosting (Prokhorenkova et al., 2017).

**Tree Symmetry:** Contrary to some other algorithms, in CatBoost trees are split consistently (using the same condition) across all nodes at the same depth of the tree (Wong, 2022). 

### 1.b. Why are we choosing CatBoost?
After this introduction to CatBoost algorithm, we can affirm that one of its main highlights is the efficient categorical variables encoding, which has been shown to outperform some other similar algorithms (Prokhorenkova et al., 2017). This characteristic is the main reason that led us to choose CatBoost as our 4th Machine Learnig model.

After building and analyzing the results of the previous models, in which we used one-hot encoding (OHE) technique for handling categorical variables, we realized that this pre-processing step was causing some issues. As it was explained in the previous notebooks, OHE transforms each category inside a variable into a new individual variable, where the rows labelled with that category has a value of 1 and the rest 0. While this is one of the most extended techniques for dealing with categorical variables, it presents the issue of highly increasing the amount of variables, potentially converting highly significant variables into multiple low-importance ones. 

This issue was clearly observed in our project, where some categorical variables that showed a high importance during EDA, where converted into multiple independent variables with really low importance in the final models. This issue leads us to think that CatBoost can present an efficient alternative to handle this issue, expecting significantly better results.

### 1.c. Python Implementation
The CatBoost algorithm is implemented in python using the library provided by the official [CatBoost project](https://catboost.ai/en/docs/concepts/parameter-tuning). The CatBoost library can be insalled by uncommenting the code line below:

In [None]:
#%pip install catboost

### 1.d. Hyperparameters

According to the CatBoost documentation, the main hyperparameters of the model are:

**Number of trees:**  maximum number of trees that the model can contain. (`iterations`)

**Learning rate:** Steps size in gradient descent. (`learning_rate`)

**Tree Depth:** Maximum depth of trees. (`depth`)

**L2 regularization:** Coefficient of the regularzation term of the cost function. (`l2_leaf_reg`)

**Random strength:** Amount of randomness for scoring splits when the tree structure is selected. Used in the case the model is presenting overfitting. (`random_strength`)

## 2. Data Loading and Preprocessing

To start the project, we will upload the original raw dataset. Different from the previous notebooks, in which we used directly the preprocessed and split training and test datasets, here the preprocessing step will be carried on again. This is because, as it was said before, one of the main advantadges of CatBoost algorithm is the fact that it can deal with categorical variables. Therefore, we need to re-apply the data processing steps to our dataset, leaving out the encoding of categorical variables. 

Moreover, some other modifications to the porcessing steps are applied according to the feedback of Milestone 2, such as data KFold stratification, which are discussed in detail in the corresponding section.

### 2.1. Loading Libraries & Importing Original Dataset

In [3]:
#Importing Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder
import matplotlib.pyplot as plt
import warnings
import catboost

In [4]:
#Setting pandas to show all the columns
pd.set_option('display.max_columns', None)

In [5]:
#Importing  and visualizing the dataset
data = pd.read_csv('OpenData/Ames.csv')
data.head()

Unnamed: 0,Order,PID,area,price,MS.SubClass,MS.Zoning,Lot.Frontage,Lot.Area,Street,Alley,Lot.Shape,Land.Contour,Utilities,Lot.Config,Land.Slope,Neighborhood,Condition.1,Condition.2,Bldg.Type,House.Style,Overall.Qual,Overall.Cond,Year.Built,Year.Remod.Add,Roof.Style,Roof.Matl,Exterior.1st,Exterior.2nd,Mas.Vnr.Type,Mas.Vnr.Area,Exter.Qual,Exter.Cond,Foundation,Bsmt.Qual,Bsmt.Cond,Bsmt.Exposure,BsmtFin.Type.1,BsmtFin.SF.1,BsmtFin.Type.2,BsmtFin.SF.2,Bsmt.Unf.SF,Total.Bsmt.SF,Heating,Heating.QC,Central.Air,Electrical,X1st.Flr.SF,X2nd.Flr.SF,Low.Qual.Fin.SF,Bsmt.Full.Bath,Bsmt.Half.Bath,Full.Bath,Half.Bath,Bedroom.AbvGr,Kitchen.AbvGr,Kitchen.Qual,TotRms.AbvGrd,Functional,Fireplaces,Fireplace.Qu,Garage.Type,Garage.Yr.Blt,Garage.Finish,Garage.Cars,Garage.Area,Garage.Qual,Garage.Cond,Paved.Drive,Wood.Deck.SF,Open.Porch.SF,Enclosed.Porch,X3Ssn.Porch,Screen.Porch,Pool.Area,Pool.QC,Fence,Misc.Feature,Misc.Val,Mo.Sold,Yr.Sold,Sale.Type,Sale.Condition
0,1,526301100,1656,215000,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,TA,TA,CBlock,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,GasA,Fa,Y,SBrkr,1656,0,0,1.0,0.0,1,0,3,1,TA,7,Typ,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal
1,2,526350040,896,105000,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal
2,3,526351010,1329,172000,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal
3,4,526353030,2110,244000,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,1968,1968,Hip,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,GasA,Ex,Y,SBrkr,2110,0,0,1.0,0.0,2,1,3,1,Ex,8,Typ,2,TA,Attchd,1968.0,Fin,2.0,522.0,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2010,WD,Normal
4,5,527105010,1629,189900,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,GasA,Gd,Y,SBrkr,928,701,0,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal


### 2.2. First Preprocessing Steps

#### 2.2.a. Renaming variables
In order to get a better and uniform handling of the variables, the columns of the original dataset are renamed, adopting the Pascal case convention (capitalizing the first letter of every word, including the first one). The abreviations for long words are kept the same as in the original dataset.

In [6]:
#Create a dictionary with the old and new variable's names
RenameMapping = {
    'area': 'BldgArea',
    'price': 'SoldPrice',
    'MS.SubClass': 'MSSubClass',
    'MS.Zoning': 'MSZoning',
    'Lot.Frontage': 'LotFrontage',
    'Lot.Area': 'LotArea',
    'Lot.Shape': 'LotShape',
    'Land.Contour': 'LandContour',
    'Lot.Config': 'LotConfig',
    'Land.Slope': 'LandSlope',
    'Condition.1': 'Condition1',
    'Condition.2': 'Condition2',
    'Bldg.Type': 'BldgType',
    'House.Style': 'HouseStyle',
    'Overall.Qual': 'OverallQual',
    'Overall.Cond': 'OverallCond',
    'Year.Built': 'YearBuilt',
    'Year.Remod.Add': 'YearRemodAdd',
    'Roof.Style': 'RoofStyle',
    'Roof.Matl': 'RoofMatl',
    'Exterior.1st': 'Exterior1st',
    'Exterior.2nd': 'Exterior2nd',
    'Mas.Vnr.Type': 'MasVnrType',
    'Mas.Vnr.Area': 'MasVnrArea',
    'Exter.Qual': 'ExterQual',
    'Exter.Cond': 'ExterCond',
    'Bsmt.Qual': 'BsmtQual',
    'Bsmt.Cond': 'BsmtCond',
    'Bsmt.Exposure': 'BsmtExposure',
    'BsmtFin.Type.1': 'BsmtFinType1',
    'BsmtFin.SF.1': 'BsmtFinSF1',
    'BsmtFin.Type.2': 'BsmtFinType2',
    'BsmtFin.SF.2': 'BsmtFinSF2',
    'Bsmt.Unf.SF': 'BsmtUnfSF',
    'Total.Bsmt.SF': 'TotalBsmtSF',
    'Heating.QC': 'HeatingQual',
    'Central.Air': 'CentralAir',
    '1st.Flr.SF': '1stFlrSF',
    '2nd.Flr.SF': '2ndFlrSF',
    'Low.Qual.Fin.SF': 'LowQualFinSF',
    'Bsmt.Full.Bath': 'BsmtFullBath',
    'Bsmt.Half.Bath': 'BsmtHalfBath',
    'Full.Bath': 'FullBath',
    'Half.Bath': 'HalfBath',
    'Kitchen.Qual': 'KitchenQual',
    'TotRms.AbvGrd': 'TotRmsAbvGrd',
    'Fireplaces': 'Fireplaces',
    'Fireplace.Qu': 'FireplaceQu',
    'Garage.Type': 'GarageType',
    'Garage.Yr.Blt': 'GarageYrBlt',
    'Garage.Finish': 'GarageFinish',
    'Garage.Cars': 'GarageCars',
    'Garage.Area': 'GarageArea',
    'Garage.Qual': 'GarageQual',
    'Garage.Cond': 'GarageCond',
    'Paved.Drive': 'PavedDrive',
    'Wood.Deck.SF': 'WoodDeckSF',
    'Open.Porch.SF': 'OpenPorchSF',
    'Enclosed.Porch': 'EnclosedPorchSF',
    '3Ssn.Porch': '3SsnPorchSF',
    'Screen.Porch': 'ScreenPorchSF',
    'Pool.Area': 'PoolArea',
    'Pool.QC': 'PoolQual',
    'Misc.Feature': 'MiscFeature',
    'Misc.Val': 'MiscVal',
    'Mo.Sold': 'MoSold',
    'Yr.Sold': 'YrSold',
    'Sale.Type': 'SaleType',
    'Sale.Condition': 'SaleCondition',
    'X1st.Flr.SF': 'X1FloorSF',
    'X2nd.Flr.SF': 'X2FloorSF',
    'X3Ssn.Porch': '3SsnPorchSF',
    'Kitchen.AbvGr': 'KitchenAbvGr',
    'Bedroom.AbvGr': 'BedroomAbvGr',
    }

#Applying the name change
data.rename(columns=RenameMapping, inplace=True)

Moreover, we will also remove the "order" column, considering that it is just an index column withou any significative meaning, and the dataset has a meaningful identificator given by the column "PID"

In [7]:
#Dropping 'Order' column
data = data.drop('Order', axis=1)

#### 2.2.b. Encoding Ordinal and Binary Variables
While CatBoost algorithm can weffectively handle categorical variables, for the case of ordinal and binary varibles we decided to rely on a controlled process. For these cases, we will follow the same process explained in Milestone 1 for converting ordinal and binary "string" variables into numerical ones. For more detailed explanations on each case, refer to Milestone 1 notebook.

In [8]:
# Lists of variables by type
ordinal = ['LotShape', 'Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQual', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQual', 'Fence', 'PavedDrive' ]
nominal = ['MSSubClass', 'MSZoning', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 'Electrical', 'GarageType', 'MiscFeature', 'SaleType', 'SaleCondition']
binary = ['Street', 'CentralAir']
other = ['Alley']

##### 2.2.b.1. Encoding Ordinal Variables

In [9]:
# Mapping dictionary
variable_mappings = {
    'LotShape': {'Reg': 4, 'IR1': 3, 'IR2': 2, 'IR3': 1},
    'Utilities': {'AllPub': 4, 'NoSewr': 3, 'NoSeWa': 2, 'ELO': 1},
    'LandSlope': {'Gtl': 1, 'Mod': 2, 'Sev': 3},
    'ExterQual': {'Ex': 5, 'Gd': 4, 'Ta': 3, 'Fa': 2, 'Po': 1},
    'ExterCond': {'Ex': 5, 'Gd': 4, 'Ta': 3, 'Fa': 2, 'Po': 1},
    'BsmtQual': {'Ex': 6, 'Gd': 5, 'Ta': 4, 'Fa': 3, 'Po': 1, 'NA': 0},
    'BsmtCond': {'Ex': 6, 'Gd': 5, 'Ta': 4, 'Fa': 3, 'Po': 1, 'NA': 0},
    'BsmtExposure': {'Ex': 5, 'Gd': 4, 'Ta': 3, 'Fa': 2, 'Po': 1},
    'BsmtFinType1': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'Lwq': 2, 'Unf': 1, 'Na': 0},
    'BsmtFinType2': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'Lwq': 2, 'Unf': 1, 'Na': 0},
    'HeatingQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1},
    'KitchenQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1},
    'Functional': {'Typ': 8, 'Min1': 7, 'Min2': 6, 'Mod': 5, 'Maj1': 4, 'Maj2': 3, 'Sev': 2, 'Sal': 1},
    'FireplaceQu': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NA': 0},
    'GarageFinish': {'Fin': 3, 'RFn': 2, 'Unf': 1, 'NA': 0},
    'GarageQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NA': 0},
    'GarageCond': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NA': 0},
    'PoolQual': {'Ex': 4, 'Gd': 3, 'TA': 2, 'Fa': 1, 'Na': 0},
    'Fence': {'GdPrv': 4, 'MnPrv': 3, 'GdWo': 2, 'MnWw': 1, 'NA': 0},
    'PavedDrive': {'N': 0, 'P': 1, 'Y': 2}
}

# List of columns to map
columns_to_map = variable_mappings.keys()

def apply_mappings(data):
    for column in columns_to_map:
        data[column] = data[column].map(variable_mappings[column])

# Applying changes to dataset
apply_mappings(data)

# Iterate through the columns in the 'ordinal' list and encode NA values as 0
for column in ordinal:
    data[column].fillna(0, inplace=True)

##### 2.2.b.2. Encoding Binary Variables

In [10]:
#Convert binary variables into numerical
Street = {'Grvl': 0, 'Pave': 1}
CentralAir = {'N': 0, 'Y': 1}

#Applying changes to ataset
data['Street'] = data['Street'].map(Street)
data['CentralAir'] = data['CentralAir'].map(CentralAir)


For the case of the 'Alley' variable, we also follow what was already discussed in Milestone 1. Given that the variable has 3 categories, 2 indicating different types of alley material (which a small amunt of positive rows each one), and the third type indicating that the house doesn't have an alley, the variable is also coverted to a binary variable, indicating the existence or not of an alley.

In [11]:
#Encode
Alley = {'Grvl': 1, 'Pave': 1, 'NA': 0}

#Applying changes to dataset
data['Alley'] = data['Alley'].map(Alley)

#Encode NA values as 0
data['Alley'].fillna(0, inplace=True)


### 2.3. Data Split into Train and Test

Following the generally agreed good practices on Machine Learning models treatment, the remaining preprocessing steps, related to NA values handling, are performed after the data splitting is done. This workflow assures that there is no data leakage occuring between the training and test sets in the case that the values of existing rows are used to fill missing values (ex: if filling NA with column mean).

The data splitting is done following the same workflow already explained in Milestone 2, in which we demonstrated the importance of permorfiming a neighborhood-based splitting, due to an unbalanced spatial distribution. We also apply the same rows dropping based on the small number of samples in certain neighbirhoods. For more details on this regard, please refer to the mentioned notebook.

In [12]:
# Drop the lines for Landmrk and GrnHill neighborhoods
neighb_todrop = ['Landmrk', 'GrnHill']
data = data[~data['Neighborhood'].isin(neighb_todrop)]

# Divide data into train and test with stratified split
data_train, data_test= train_test_split(data, test_size=0.2, random_state=33, stratify=data['Neighborhood'])

### 2.4. Handling NA Values
The workflow for NA values handling also follows the same process already analyzed, explained and established for each variable in Milestone 1. for more details, please refer to the mentioned notebook.

In [13]:
#Filling NAs on training
data_train['MiscFeature'].fillna('None', inplace=True)
data_train['GarageType'].fillna('None', inplace=True)

#Filling NAs on test
data_test['MiscFeature'].fillna('None', inplace=True)
data_test['GarageType'].fillna('None', inplace=True)

In [14]:
columns_to_check = ['MasVnrType' , 'BsmtHalfBath', 'BsmtFullBath', 'GarageCars', 'Electrical', 'GarageArea']

#  Drop rows with NaN values in specific columns in training
data_train.dropna(subset=columns_to_check, inplace=True)

# Drop rows with NaN values in specific columns in test
data_test.dropna(subset=columns_to_check, inplace=True)

In [15]:
## LotFrontage VARIABLE

#Fill Na values with column mean - training
mean_LotFrontage_train = data_train['LotFrontage'].mean()
data_train['LotFrontage'].fillna(mean_LotFrontage_train, inplace=True)

#Fill Na values with column mean - test
mean_LotFrontage_test = data_test['LotFrontage'].mean()
data_test['LotFrontage'].fillna(mean_LotFrontage_test, inplace=True)

In [16]:
## GarageYrBlt VARIABLE

#Filling GarageYrBlt NA values with YearBuilt in training
data_train['GarageYrBlt'].fillna(data_train['YearBuilt'], inplace=True)

#Filling GarageYrBlt NA values with YearBuilt in test
data_test['GarageYrBlt'].fillna(data_test['YearBuilt'], inplace=True)

Finally, we check that there is not any NA value in both train and test datasets and we visualize how the datasets looks like after preprocessing.

In [17]:
#Check NAs in test
data_test.isna().any().any()

False

In [18]:
#Check NAs in test
data_test.isna().any().any()

False

In [19]:
# Previsualize train dataset
data_train.head()

Unnamed: 0,PID,BldgArea,SoldPrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQual,CentralAir,Electrical,X1FloorSF,X2FloorSF,LowQualFinSF,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorchSF,3SsnPorchSF,ScreenPorchSF,PoolArea,PoolQual,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
174,902125080,1605,107400,50,RM,60.0,5790,1,0.0,4,Lvl,4,Corner,1,OldTown,Norm,Norm,1Fam,2Story,3,6,1915,1950,Gambrel,CompShg,VinylSd,VinylSd,,0.0,4.0,4.0,CBlock,3.0,0.0,0.0,1.0,0.0,1.0,0.0,840.0,840.0,GasA,4,0,SBrkr,840,765,0,0.0,0.0,2,0,3,2,3,8,8,0,0.0,Detchd,1915.0,1.0,1.0,379.0,3.0,3.0,2,0,0,202,0,0,0,0.0,0.0,,0,5,2010,WD,Normal
1765,528344020,2582,322500,60,RL,74.0,11002,1,0.0,3,Lvl,4,Inside,1,NoRidge,Norm,Norm,1Fam,2Story,8,5,1998,1999,Gable,CompShg,VinylSd,VinylSd,,0.0,4.0,0.0,PConc,5.0,0.0,0.0,6.0,1048.0,1.0,0.0,341.0,1389.0,GasA,5,1,SBrkr,1411,1171,0,1.0,0.0,2,1,4,1,4,9,8,1,3.0,Attchd,1998.0,3.0,3.0,758.0,3.0,3.0,2,286,60,0,0,0,0,0.0,0.0,,0,1,2007,WD,Normal
2891,916225130,2519,335000,60,RL,42.0,26178,1,0.0,3,Lvl,4,Inside,2,Timber,Norm,Norm,1Fam,2Story,7,5,1989,1990,Hip,CompShg,MetalSd,MetalSd,BrkFace,293.0,4.0,0.0,PConc,5.0,0.0,4.0,6.0,965.0,1.0,0.0,245.0,1210.0,GasA,5,1,SBrkr,1238,1281,0,1.0,0.0,2,1,4,1,4,9,8,2,4.0,Attchd,1989.0,2.0,2.0,628.0,3.0,3.0,2,320,27,0,0,0,0,0.0,0.0,,0,4,2006,WD,Normal
125,534427010,1728,84900,90,RL,98.0,13260,1,0.0,3,Lvl,4,Inside,1,NAmes,Norm,Norm,Duplex,1Story,5,6,1962,2001,Hip,CompShg,HdBoard,HdBoard,BrkFace,144.0,0.0,0.0,CBlock,0.0,0.0,0.0,4.0,1500.0,1.0,0.0,228.0,1728.0,GasA,3,1,SBrkr,1728,0,0,2.0,0.0,2,0,6,2,3,10,8,0,0.0,,1962.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,0,0,0,0.0,0.0,,0,1,2010,Oth,Abnorml
1234,535150210,1098,135000,20,RL,69.099331,7390,1,0.0,3,Lvl,4,Inside,1,NAmes,Norm,Norm,1Fam,1Story,5,7,1955,1955,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,151.0,0.0,0.0,CBlock,0.0,0.0,0.0,5.0,902.0,1.0,0.0,196.0,1098.0,GasA,3,1,SBrkr,1098,0,0,1.0,0.0,1,0,3,1,3,6,8,0,0.0,Attchd,1955.0,1.0,1.0,260.0,3.0,3.0,2,0,0,0,0,0,0,0.0,0.0,,0,7,2008,WD,Normal


In [20]:
# Previsualize test dataset
data_test.head()

Unnamed: 0,PID,BldgArea,SoldPrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQual,CentralAir,Electrical,X1FloorSF,X2FloorSF,LowQualFinSF,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorchSF,3SsnPorchSF,ScreenPorchSF,PoolArea,PoolQual,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
1604,923277030,816,130500,20,RL,60.0,6600,1,0.0,4,Lvl,4,Inside,1,Mitchel,Norm,Norm,1Fam,1Story,5,9,1982,2008,Gable,CompShg,VinylSd,VinylSd,,0.0,4.0,4.0,CBlock,0.0,0.0,0.0,5.0,641.0,1.0,0.0,175.0,816.0,GasA,5,1,SBrkr,816,0,0,0.0,1.0,1,0,3,1,4,5,8,1,5.0,Attchd,1982.0,1.0,1.0,264.0,3.0,3.0,2,0,0,0,0,0,0,0.0,3.0,,0,10,2008,WD,Normal
2510,533221080,1524,166000,160,FV,69.392857,2998,1,0.0,4,Lvl,4,Inside,1,Somerst,Norm,Norm,TwnhsE,2Story,6,5,2000,2000,Gable,CompShg,MetalSd,MetalSd,BrkFace,513.0,4.0,0.0,PConc,5.0,0.0,0.0,6.0,353.0,1.0,0.0,403.0,756.0,GasA,5,1,SBrkr,768,756,0,0.0,0.0,2,1,2,1,4,4,8,0,0.0,Detchd,2000.0,1.0,2.0,440.0,3.0,3.0,2,0,32,0,0,0,0,0.0,0.0,,0,6,2006,WD,Normal
430,528108140,2020,402861,20,RL,94.0,12220,1,0.0,4,Lvl,4,Inside,1,NridgHt,Norm,Norm,1Fam,1Story,10,5,2009,2009,Hip,CompShg,CemntBd,CmentBd,BrkFace,305.0,5.0,0.0,CBlock,6.0,0.0,0.0,6.0,1436.0,1.0,0.0,570.0,2006.0,GasA,5,1,SBrkr,2020,0,0,1.0,0.0,2,1,3,1,5,9,8,1,4.0,Attchd,2009.0,3.0,3.0,900.0,3.0,3.0,2,156,54,0,0,0,0,0.0,0.0,,0,9,2009,New,Partial
2900,916477010,1960,320000,20,RL,95.0,13618,1,0.0,4,Lvl,4,Corner,1,Timber,Norm,Norm,1Fam,1Story,8,5,2005,2006,Gable,CompShg,VinylSd,VinylSd,Stone,198.0,4.0,0.0,PConc,6.0,5.0,0.0,6.0,1350.0,1.0,0.0,378.0,1728.0,GasA,5,1,SBrkr,1960,0,0,1.0,0.0,2,0,3,1,4,8,8,2,4.0,Attchd,2005.0,3.0,3.0,714.0,3.0,3.0,2,172,38,0,0,0,0,0.0,0.0,,0,11,2006,New,Partial
665,535383120,725,78500,30,RL,60.0,10800,1,1.0,4,Lvl,4,Corner,1,OldTown,Norm,Norm,1Fam,1Story,3,5,1890,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,0.0,0.0,BrkTil,0.0,0.0,0.0,1.0,0.0,1.0,0.0,630.0,630.0,GasA,3,1,FuseA,725,0,0,0.0,0.0,1,1,1,1,3,4,8,0,0.0,Detchd,1959.0,1.0,1.0,320.0,3.0,3.0,2,0,30,0,0,0,0,0.0,0.0,,0,11,2009,WD,Normal


### Defining a catboost model - checking accuracy cross validation score (without tuning)

In [21]:
# import utilities
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import StratifiedKFold

### Hyperparameter tuning

##### CV object creation
In order to be able to perform GridSearchCV on training data, we should pass the model, the parameters and the cross validation object to the `GridSearchCV` function. For this, we create a cross validation object with 5 folds. This cannot be directly created using the instantiation of the StratifiedKFold object because such an object will only be useful in a classification problem. Where stratification is carried out on the target variable. Since we need to stratify based on an independent variable, `neighborhood` , we need to create the splits accordingly and then convert them into a CV object which can be passed to GridSearchCV function. 

In [None]:
#inner loop - cross validation
kf = StratifiedKFold(n_splits=5, random_state=33, shuffle=True)

#we want each of the folds to have the same distribution of neighborhoods
splits = kf.split(data_train, data_train['Neighborhood'])
print(splits)
# creating a nested list of train and validation indices for each fold
train_indices, val_indices = [list(trainval) for trainval in zip(*splits)]

cv_object = [*zip(train_indices, val_indices)] 
# use this cv object in grid search

In [None]:
# use this loop to tune - in case grisearchcv cant be implemented
for fold, (train_index, val_index) in enumerate(splits):
    print(f"Fold {fold}")
    print(" ")
    train_data = data_train.iloc[train_index]
    val_data = data_train.iloc[val_index]

    # print neighborhood wise distribution as percentages in descending order in train and val
    print("Train")
    print(train_data['Neighborhood'].value_counts(normalize=True).sort_values(ascending=False))
    print("Val")
    print(val_data['Neighborhood'].value_counts(normalize=True).sort_values(ascending=False))

# References
- Notebook GitHub: https://github.com/anantgupta129/CatBoost-in-Python-ML/tree/master
- Catboost paper: https://arxiv.org/pdf/1706.09516.pdf
- Catboost paper 2: http://learningsys.org/nips17/assets/papers/paper_11.pdf
- Catboost library website: https://catboost.ai/
- Comprehensive notebook on catboost feature encoding paramters tunning: https://github.com/catboost/catboost/blob/master/catboost/tutorials/categorical_features/categorical_features_parameters.ipynb

https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

https://towardsdatascience.com/what-is-stratified-cross-validation-in-machine-learning-8844f3e7ae8e 

https://www.kaggle.com/alexisbcook/cross-validation