Gradient Boosted trees have become one of the most powerful algorithms for training on tabular data. Over the recent past, we’ve been fortunate to have may implementations of boosted trees – each with their own unique characteristics.
In this notebook, I will implement LightGBM, XGBoost and CatBoost to tackle this Kaggle problem.

**What is  Boosting**

Boosting is a sequential technique which works on the principle of an ensemble. How Boosting Algorithm Works?

The basic principle behind the working of the boosting algorithm is to generate multiple weak learners and combine their predictions to form one strong rule. These weak rules are generated by applying base Machine Learning algorithms on different distributions of the data set. These algorithms generate weak rules for each iteration. After multiple iterations, the weak learners are combined to form a strong learner that will predict a more accurate outcome.
 Note that a weak learner is one which is slightly better than random guessing. For example, a decision tree whose predictions are slightly better than 50%.

 **Gradient Boosting** works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

**Here’s how the algorithm works**:

**Step 1**: The base algorithm reads the data and assigns equal weight to each sample observation.

**Step 2**: False predictions made by the base learner are identified. In the next iteration, these false predictions are assigned to the next base learner with a higher weightage on these incorrect predictions.

**Step 3**: Repeat step 2 until the algorithm can correctly classify the output.

Therefore, the main aim of Boosting is to focus more on miss-classified predictions.

![img](https://i.imgur.com/OpP7D0X.png)

[Source](https://catboost.ai/news/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus)

These techniques are used to build ensemble models in an iterative way. On the first iteration, the algorithm learns the first tree to reduce the training error, shown on left-hand image above. The right-hand image above, shows the second iteration, in which the algorithm learns one more tree to reduce the error made by the first tree. The algorithm repeats this procedure until it builds a decent quality mode.

![img](https://i.imgur.com/OpP7D0X.png)

The common approach for classification uses Logloss while regression optimizes using root mean square error. Ranking tasks commonly implements some variation of LambdaRank.

In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import preprocessing, model_selection, metrics
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor

from IPython.display import display # Allows the use of display() for DataFrames

import warnings
warnings.filterwarnings('ignore')

In [23]:
train_df = pd.read_csv('../input/santander-value-prediction-challenge/train.csv')

# Given the test.csv file is huge, reading which each time during development
# takes couple of minutes - making my development process slower.
# I am reading only first 100 rows during development.
test_df = pd.read_csv('../input/santander-value-prediction-challenge/test.csv', nrows=100)

# But in Kaggle Kernel, and before final submission
# comment-out the above line and un-comment below line to read the full train.csv
# test_df = pd.read_csv('../input/santander-value-prediction-challenge/test.csv')

train_df.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


In [24]:
test_df.head()

Unnamed: 0,ID,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,20aa07010,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000137c73,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,00021489f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0004d7953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,00056a333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,00056d8eb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Columns: 4993 entries, ID to 9fc776466
dtypes: float64(1845), int64(3147), object(1)
memory usage: 169.9+ MB


Initial Observations looking at the above data

- Column name does not mean anything now, as they are all anonymized
- The dataframe is full of zero values.
- The dataset is a sparse tabular one refer [this](https://www.kaggle.com/c/santander-value-prediction-challenge/discussion/59128)

Target Variable:

First doing some scatter plot of the target variable to check for visible outliers.

In [26]:
print('Train rows and columns: ', train_df.shape)

# Keeping below line commented out as its huge 49,342 row file with 1gb size and so take longer to run each time
print('Test rows and columns: ', test_df.shape)

Train rows and columns:  (4459, 4993)
Test rows and columns:  (49342, 4992)


In [27]:
# Keeping below lines commented out during development

# plt.figure(figsize=(8,6))
# plt.scatter(range(train_df.shape[0]), np.sort(train_df['target'].values))
# plt.xlabel('index', fontsize=12)
# plt.ylabel('Target', fontsize=12)
# plt.title('Distribution of Target', fontsize=14)
# plt.show()

TO-DO - So there's not too much of outliers (visibly) but the distribution range is high. Now want to do a histogram

## Checking for missing / null values in data

In [28]:
print("All Features in Train data with NaN Values =", str(train_df.columns[train_df.isnull().sum() != 0].size) )
# print("All Features in Test data with NaN Values =", str(test_df.columns[train_df.isnull().sum() != 0].size) )

All Features in Train data with NaN Values = 0


## Remove constant columns from data

In [29]:
const_columns_to_remove = []
for col in train_df.columns:
    if col != 'ID' and col != 'target':
        if train_df[col].std() == 0:
            const_columns_to_remove.append(col)

# Now remove that array of const columns from the data
train_df.drop(const_columns_to_remove, axis=1, inplace=True)
test_df.drop(const_columns_to_remove, axis=1, inplace=True)

# Print to see the reduction of columns
print('Train rows and columns after removing constant columns: ', train_df.shape)

print('Following `{}` Constant Column\n are removed'.format(len(const_columns_to_remove)))
print(const_columns_to_remove)

Train rows and columns after removing constant columns:  (4459, 4737)
Following `256` Constant Column
 are removed
['d5308d8bc', 'c330f1a67', 'eeac16933', '7df8788e8', '5b91580ee', '6f29fbbc7', '46dafc868', 'ae41a98b6', 'f416800e9', '6d07828ca', '7ac332a1d', '70ee7950a', '833b35a7c', '2f9969eab', '8b1372217', '68322788b', '2288ac1a6', 'dc7f76962', '467044c26', '39ebfbfd9', '9a5ff8c23', 'f6fac27c8', '664e2800e', 'ae28689a2', 'd87dcac58', '4065efbb6', 'f944d9d43', 'c2c4491d5', 'a4346e2e2', '1af366d4f', 'cfff5b7c8', 'da215e99e', '5acd26139', '9be9c6cef', '1210d0271', '21b0a54cb', 'da35e792b', '754c502dd', '0b346adbd', '0f196b049', 'b603ed95d', '2a50e001c', '1e81432e7', '10350ea43', '3c7c7e24c', '7585fce2a', '64d036163', 'f25d9935c', 'd98484125', '95c85e227', '9a5273600', '746cdb817', '6377a6293', '7d944fb0c', '87eb21c50', '5ea313a8c', '0987a65a1', '2fb7c2443', 'f5dde409b', '1ae50d4c3', '2b21cd7d8', '0db8a9272', '804d8b55b', '76f135fa6', '7d7182143', 'f88e61ae6', '378ed28e0', 'ca4ba131e', 

## Remove Duplicate Columns

**I will be using the duplicated() function of pandas - here's how it works:**

Suppose the columns of the data frame are `['alpha','beta','alpha']`

`df.columns.duplicated()` returns a boolean array: a `True` or `False` for each column. If it is `False` then the column name is unique up to that point, if it is `True` then the column name is duplicated earlier. For example, using the given example, the returned value would be `[False,False,True]`. 

`Pandas` allows one to index using boolean values whereby it selects only the `True` values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie `[True, True, False] = ~[False,False,True]`)

Finally, `df.loc[:,[True,True,False]]` selects only the non-duplicated columns using the aforementioned indexing capability. 

**Note**: the above only checks columns names, *not* column values.

In [30]:
train_df = train_df.loc[:,~train_df.columns.duplicated()]
print('Train rows and columns after removing duplicate columns: ', train_df.shape)

Train rows and columns after removing duplicate columns:  (4459, 4737)


## Handling Sparse data

**What is Sparse data**

As an example, let's say that we are collecting data from a device which has 12 sensors. And you have collected data for 10 days.

The data you have collected is as follows:
[![enter image description here][1]][1]

The above is an example of sparse data because most of the sensor outputs are zero. Which means those sensors are functioning properly but the actual reading is zero. Although this matrix has high dimensional data (12 axises) it can be said that it contains less information.

So basically, sparse data means that there are many gaps present in the data being recorded. For example, in the case of the sensor mentioned above, the sensor may send a signal only when the state changes, like when there is a movement of the door in a room. This data will be obtained intermittently because the door is not always moving. Hence, this is sparse data.

  [1]: https://i.stack.imgur.com/Af5IH.png


First lets have a look at or train_df data again, that how much of sparse data is there. And as we can see there are plenty of '0'

In [31]:
train_df.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


## Check and handle total memory of data

`get_dummies` pandas function converts categorical variables into indicator variables.

In [32]:
def print_memory_usage_of_df(df):
    bytes_per_mb = 0.000001
    memory_usage = round(df.memory_usage().sum() * bytes_per_mb, 3)
    print('Memory usage is ', str(memory_usage) + " MB")

print_memory_usage_of_df(train_df)
print(train_df.shape)

Memory usage is  168.978 MB
(4459, 4737)


In [33]:
dummy_encoded_train_df = pd.get_dummies(train_df)
dummy_encoded_train_df.shape

(4459, 9195)

In [34]:
print_memory_usage_of_df(dummy_encoded_train_df)

Memory usage is  188.825 MB


We see that the memory usage of the dummy_encoded_train_df data  frame is larger compared to the original, because now the number of columns have increased in the data frame.

##### So lets apply `sparse=True` if it reduces the memory-usages to some extent.

This parameter `sparse` defaults to False. If True the encoded columns are returned as **SparseArray**. By setting `sparse=True` we create a sparse data frame directly

In [35]:
dummy_encoded_sparse_train_df = pd.get_dummies(train_df, sparse=True)
dummy_encoded_sparse_train_df.shape

(4459, 9195)

In [36]:
print_memory_usage_of_df(dummy_encoded_sparse_train_df)


Memory usage is  168.965 MB


But looks like in this case the reduction in memory_size was not a huge amount. So lets try some other alternative

## [Pandas Sparse Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparse-data-structures)

Pandas provides data structures for efficient storage of sparse data. In these structures, zero values (or any other specified value) are not actually stored in the array. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

Storing only the non-zero values and their positions is a common technique in storing sparse data sets.

This hugely reduces the memory usage of our data set and “compress” the data frame.

In our example, we will convert the one-hot encoded columns into SparseArrays, which are 1-d arrays where only non-zero values are stored.

In [37]:
def convert_df_to_sparse_array(df, exclude_columns=[]):
    df = df.copy()
    exclude_columns = set(exclude_columns)

    for (column_name, column_data) in df.iteritems():
        if column_name in exclude_columns:
            continue
        df[column_name] = pd.SparseArray(column_data.values, dtype='uint8')

    return df

# Now convert our earlier dummy_encoded_train_df with above function and check memory_size

# train_data_post_conversion_to_sparse_array = convert_df_to_sparse_array(dummy_encoded_train_df)
# print('Sparse Array Train_DF rows and columns: ', train_data_post_conversion_to_sparse_array.shape)
# print_memory_usage_of_df(train_data_post_conversion_to_sparse_array)

# Commenting the above out - for running the Notebook faster during my development

**We see the that the memory_usage is substantially reduced now**

### A warning on using df.iteritems()

The df.iteritems() iterates over columns and not rows. Generally iteration over dataframes is an anti-pattern, and something we should avoid, unless you want to get used to a lot of waiting.

## Sparse Data Removal (Following simpler plain-vanilla technique)

### For this notebook, I will go with the easier approach to handle sparse data - which is just to drop it from the dataframe
like below code, I will do this for the sake of running this notebook faster
And then later check back which approach gives me better predictions

In [38]:
def drop_parse_from_df(df):
    column_list_to_drop_data_from = [i for i in df.columns if not i in ['ID', 'target'] ]
    for column in column_list_to_drop_data_from:
        if len(np.unique(df[column])) < 2:
            df.drop(column, axis=1, inplace=True)
            df.drop(column, axis=1, inplace=True)
    return df

# The same above function, if I wanted to do for 2 dataframes together.
# def drop_parse_from_df(df_1, df_2):
#     column_list_to_drop_data_from = [i for i in df_1.columns if not i in ['ID', 'target'] ]
#     for column in column_list_to_drop_data_from:
#         if len(np.unique(df_1[column])) < 2:
#             df_1.drop(column, axis=1, inplace=True)
#             df_2.drop(column, axis=1, inplace=True)
#     return df_1, df_2

train_df = drop_parse_from_df(train_df)
print('Rows and Columns in train_df after removing sparse ', format(train_df.shape))


Rows and Columns in train_df after removing sparse  (4459, 4737)


### Split data into Train and Test for Model Training

In [39]:
X_train = train_df.drop(['ID', 'target'], axis=1)

y_train = np.log1p(train_df['target'].values)

X_test = test_df.drop('ID', axis=1)

X_train_split, X_validation, y_train_split, y_validation = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

## LightGBM Model Training

### Fundamentals of LightGBM Model

It is a **gradient boosting** model that makes use of tree based learning algorithms. It is considered to be a fast processing algorithm.

While other algorithms trees grow horizontally, LightGBM algorithm grows vertically, meaning it grows leaf-wise and other algorithms grow level-wise. LightGBM chooses the leaf with large loss to grow. It can lower down more loss than a level wise algorithm when growing the same leaf.

![img](https://i.imgur.com/pzOP2Lb.png)

[Source of Image](https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/)

Light GBM is prefixed as Light because of its high speed. Light GBM can handle the large size of data and takes lower memory to run.

Another reason why Light GBM is so popular is because it focuses on accuracy of results. LGBM also supports GPU learning and thus data scientists are widely using LGBM for data science application development.

**Leaf growth technique in LightGBM**

LightGBM uses leaf-wise (best-first) tree growth. It chooses to grow the leaf that minimizes the loss, allowing a growth of an imbalanced tree. Because it doesn’t grow level-wise, but leaf-wise, over-fitting can happen when data is small. In these cases, it is important to control the tree depth.

### When to use LightGBM ?

LightGBM is not preferred for a small volume of datasets as it can easily overfit small data due to its sensitivity. Hence, it generally advised for data having more than 10,000+ rows, though there is no fixed threshold that helps in deciding the usage of LightGBM.

## What are LightGBM Parameters?

While, LightGBM has more than 100 parameters that are given in the [documentation of LightGBM](https://github.com/microsoft/LightGBM), let's checkout the most important ones.

#### Control Parameters

**Max depth**: It gives the depth of the tree and also controls the overfitting of the model. If you feel your model is getting overfitted lower down the max depth.

**Min_data_in_leaf**: Leaf minimum number of records also used for controlling overfitting of the model.

**Feature_fraction**: It decides the randomly chosen parameter in every iteration for building trees. If it is 0.7 then it means 70% of the parameter would be used.

**Bagging_fraction**: It checks for the data fraction that will be used in every iteration. Often, used to increase the training speed and avoid overfitting.

**Early_stopping_round**: If the metric of the validation data does show any improvement in last early_stopping_round rounds. It will lower the imprudent iterations.

**Lambda**: It states regularization. Its values range from 0 to 1.

**Min_gain_to_split**: Used to control the number of splits in the tree.

### Core Parameters

**Task**: It tells about the task that is to be performed on the data. It can either train on the data or prediction on the data.

**Application**: This parameter specifies whether to do regression or classification. LightGBM default parameter for application is regression.

**Binary**: It is used for binary classification.

**Multiclass**: It is used for multiclass classification problems.

**Regression**: It is used for doing regression.

**Boosting**: It specifies the algorithm type.

**rf** :  Used for Random Forest.

**Goss**: Gradient-based One Side Sampling.

**Num_boost_round**: It tells about the boosting iterations.

**Learning_rate**: The role of learning rate is to power the magnitude of the changes in the approximate that gets updated from each tree’s output. It has values : 0.1,0.001,0.003.

**Num_leaves**: It gives the total number of leaves that would be present in a full tree, default value: 31

### Metric Parameter

It takes care of the loss while building the model. Some of them are stated below for classification as well as regression.

**Mae**: Mean absolute error.

**Mse**: Mean squared error.

**Binary_logloss**: Binary Classification loss.

**Multi_logloss**: Multi Classification loss.


### Parameter Tuning

Parameter Tuning is an important part that is usually done by data scientists to achieve a good accuracy, fast result and to deal with overfitting. Let us see quickly some of the parameter tuning you can do for better results.

**num_leaves**: This parameter is responsible for the complexity of the model. Its values should be ideally less than or equal to 2. If its value is more it would result in overfitting of the model.

**Min_data_in_leaf**: Assigning bigger value to this parameter can result in underfitting of the model. Giving it a value of 100 or 1000 is sufficient for a large dataset.

**Max_depth**: To limit the depth of the tree max_depth is used.

In [40]:
def run_light_gbm(train_x, train_y, validation_x, validation_y, test_x):
    params = {
        "objective" : "regression",
        "metric" : "rmse",
        "num_leaves" : 40,
        "learning_rate" : 0.004,
        "bagging_fraction" : 0.6,
        "feature_fraction" : 0.6,
        "bagging_frequency" : 6,
        "bagging_seed" : 42,
        "verbosity" : -1,
        "seed": 42
    }

    lg_train = lgb.Dataset(train_x, label=train_y)
    lg_validation = lgb.Dataset(validation_x, label=validation_y)
    evals_result = {}

    model = lgb.train(params, lg_train, 5000,
                      valid_sets=[lg_train, lg_validation],
                      early_stopping_rounds=100,
                      verbose_eval=150,
                      evals_result=evals_result )

    pred_test_y = np.expm1(model.predict(test_x, num_iteration=model.best_iteration ))

    return pred_test_y, model, evals_result

In [41]:
# Training and output of LightGBM Model
predictions_test, model, evals_result = run_light_gbm(X_train_split, y_train_split, X_validation, y_validation, X_test)
print('Output of LightGBM Model training..')



Training until validation scores don't improve for 100 rounds
[150]	training's rmse: 1.5082	valid_1's rmse: 1.53919
[300]	training's rmse: 1.34436	valid_1's rmse: 1.46593
[450]	training's rmse: 1.23324	valid_1's rmse: 1.43393
[600]	training's rmse: 1.14931	valid_1's rmse: 1.41848
[750]	training's rmse: 1.08371	valid_1's rmse: 1.41315
[900]	training's rmse: 1.03011	valid_1's rmse: 1.41131
Early stopping, best iteration is:
[934]	training's rmse: 1.01913	valid_1's rmse: 1.41118
Output of LightGBM Model training..


## Differences in LightGBM & XGBoost

LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split. Here instances mean observations/samples.

Let's see how pre-sorting splitting works-

- For each node, enumerate over all features

- For each feature, sort the instances by feature value

- Use a linear scan to decide the best split along that feature basis information gain

- Take the best split solution along all the features

In simple terms, Histogram-based algorithm splits all the data points for a feature into discrete bins and uses these bins to find the split value of histogram. While, it is efficient than pre-sorted algorithm in training speed which enumerates all possible split points on the pre-sorted feature values, it is still behind GOSS in terms of speed.

## CatBoost Model Training

CatBoost is another competitor to XGBoost, LightGBM and H2O. “CatBoost” name comes from two words “Category” and “Boosting”.

The library works well with multiple Categories of data, such as audio, text, image including historical data.

The CatBoost library can be used to solve both classification and regression challenge. For classification, you can use **“CatBoostClassifier”** and for regression, **“CatBoostRegressor“**.

**[Yandex](https://yandex.com/)** is relying heavily on Catboost for ranking, forecasting and recommendations. This model is serving more than 70 million users each month.

"CatBoost is an algorithm for gradient boosting on decision trees. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within the company for ranking tasks, forecasting and making recommendations. It is universal and can be applied across a wide range of areas and to a variety of problems."

Overall some of the algorithmic enhancements that **Catboost** brought:

- 1. For data with **categorical** features the accuracy of CatBoost would be better compared to other algorithms.

2. Better overfitting handling: - CatBoost uses the implementation of ordered boosting, an alternative to the classic boosting algorithm, which will be specially significant on small datasets

3. GPU-training: - The versions of CatBoost available from pip install (pip install catboost) and conda install (conda install catboost) have GPU support out-of-the-box. You just need to specify that you want to train your model on GPU in the corresponding HP (will be shown below).

**Categorical features handling in CatBoost Algorithm**

[The below is taken from this paper](http://learningsys.org/nips17/assets/papers/paper_11.pdf)

Categorical features have a discrete set of values called categories which are not necessary comparable with each other; thus, such features cannot be used in binary decision trees directly. A common practice for dealing with categorical features is converting them to numbers at the preprocessing time, i.e., each category for each example is substituted with one or several numerical values. The most widely used technique which is usually applied to low-cardinality categorical features is one-hot encoding: the original feature is removed and a new binary variable is added for each category [14]. One-hot encoding can be done during the preprocessing phase or during training, the latter can be implemented more efficiently in terms of training time and is implemented in CatBoost.

For further details on this red [CatBoost's documentation](https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html)

**Leaf growth algorithm in CatBoost**

Catboost grows a balanced tree. In each level of such a tree, the feature-split pair that brings to the lowest loss (according to a penalty function) is selected and is used for all the level’s nodes. It is possible to change its policy using the grow-policy parameter.

## CatBoost Training Parameters

Let’s look at the common parameters in CatBoost:

**loss_function** alias as **objective** — Metric used for training. These are regression metrics such as root mean squared error for regression and logloss for classification.

**eval_metric** — Metric used for detecting over-fitting.

**iterations** — The maximum number of trees to be built, defaults to 1000. It aliases are num_boost_round, n_estimators, and num_trees.

**learning_rate** alias **eta** — The learning rate that determines how fast or slow the model will learn. The default is usually varies between 0.01 to 0.03.

**random_seed** alias **random_state** — The random seed used for training.

**l2_leaf_reg** alias **reg_lambda** — Coefficient at the L2 regularization term of the cost function. The default is 3.0.
bootstrap_type — Determines the sampling method for the weights of the objects, e.g Bayesian, Bernoulli, MVS, and Poisson.
depth —The depth of the tree.
grow_policy — Determines how the greedy search algorithm will be applied. It can be either SymmetricTree, Depthwise, or Lossguide. SymmetricTree is the default. In SymmetricTree, the tree is built level-by-level until the depth is attained. In every step, leaves from the previous tree are split with the same condition. When Depthwise is chosen, a tree is built step-by-step until the specified depth is achieved. On each step, all non-terminal leaves from the last tree level are split. The leaves are split using the condition that leads to the best loss improvement. In Lossguide, the tree is built leaf-by-leaf until the specified number of leaves is attained. On each step, the non-terminal leaf with the best loss improvement is split
min_data_in_leaf alias min_child_samples — This is the minimum number of training samples in a leaf. This parameter is only used with the Lossguide and Depthwise growing policies.
max_leaves alias num_leaves — This parameter is used only with the Lossguide policy and determines the number of leaves in the tree.
ignored_features — Indicates the features that should be ignored in the training process.
nan_mode — The method for dealing with missing values. The options are Forbidden, Min, and Max. The default is Min. When Forbidden is used, the presence of missing values leads to errors. With Min, the missing values are taken as the minimum values for that feature. In Max, the missing values are treated as the maximum value for the feature.
leaf_estimation_method — The method used to calculate values in leaves. In classification, 10 Newton iterations are used. Regression problems using quantile or MAE loss use one Exact iteration. Multi classification uses one Netwon iteration.
leaf_estimation_backtracking — The type of backtracking to be used during gradient descent. The default is AnyImprovement. AnyImprovement decreases the descent step, up to where the loss function value is smaller than it was in the last iteration. Armijo reduces the descent step until the Armijo condition is met.
boosting_type — The boosting scheme. It can be plain for the classic gradient boosting scheme, or ordered, which offers better quality on smaller datasets.
score_function — The score type used to select the next split during tree construction. Cosine is the default option. The other available options are L2, NewtonL2, and NewtonCosine.
early_stopping_rounds — When True, sets the overfitting detector type to Iter and stops the training when the optimal metric is achieved.
classes_count — The number of classes for multi-classification problems.
task_type — Whether you are using a CPU or GPU. CPU is the default.
devices — The IDs of the GPU devices to be used for training.
cat_features — The array with the categorical columns.
text_features —Used to declare text columns in classification problems.



## Note on XGBoost

Below we will be using **XGBoost** which is an advanced version of Gradient boosting method, it literally means eXtreme Gradient Boosting. XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The XGBoost library implements the [gradient boosting decision tree algorithm](https://en.wikipedia.org/wiki/Gradient_boosting).

Different from the traditional gradient descent technique, gradient enhancement helps to predict the optimal gradient of the additional model. This technique can reduce the output error at each iteration.

In practice what we do in order to build the learner is to:

- Start with single root (contains all the training examples)

- Iterate over all features and values per feature, and evaluate each possible split loss reduction:

- gain = loss(father instances) - (loss(left branch)+loss(right branch))

- The gain for the best split must be positive (and > min_split_gain parameter), otherwise we must stop growing the branch.

**Leaf growth**

XGboost splits up to the specified max_depth hyperparameter and then starts pruning the tree backwards and removes splits beyond which there is no positive gain. It uses this approach since sometimes a split of no loss reduction may be followed by a split with loss reduction. XGBoost can also perform leaf-wise tree growth (as LightGBM).

Normally it is impossible to enumerate all the possible tree structures q. A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead. Assume that I_L and I_R are the instance sets of left and right nodes after the split. Then the loss reduction after the split is given by,

![](https://i.imgur.com/jzyLh81.png)

## Creating DataFrame for Final Submission

In [None]:

sub = pd.read_csv('../input/santander-value-prediction-challenge/sample_submission.csv')

submission_lgb = pd.DataFrame()
submission_lgb['target'] = predictions_test

sub['target'] = submission_lgb['target']

## Creating Output file for Submission

In [None]:
print(sub.head())
sub.to_csv('sub_lgb.csv', index=False)