## CatBoost

- short for Categorical boosting
- Developed by yandex and open sourced in 2017.



### What makes it different from other Gradient boosting techniques ?


#### 1. Symmetric trees

CatBoost builds **symmetric (balanced) trees**, unlike XGBoost and LightGBM.



#### What are symmetric trees ?



A conventional decision tree will have different condition within each level of tree.

- But a symmetric trees will have same split condition at each level of tree

Let's see how it looks:



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/367/original/img.png?1706166816' width=800></center>




#### But, this using same split condition will make predictions worse ?

Yes!
- And this is the way cat-boost makes base learners weak (and avoids overfitting)
- using adds these learners sequentially to eventually make the final prediction.

#### How do these symmetric trees help ?

This balanced tree architecture aids in efficient CPU implementation, decreases prediction time
- When each level has the same condition to evaluate => reduces inference time (in this case, Income > 10k)
- Compared to when there are different conditions to evaluate at the same level.

Hence, making catboost faster and avoid overfitting


#### 2. Categorical feature encoding support


Other boosting algo (GBDT, XGBoost, LightGBM) involves
- encoding the categorical features explicitly before inputting them.

This is not the case with CatBoost
- you don't have to encode the categorical features
- it does the categorical encoding implicitly.

#### Which categorical encoding does Catboost support?

CatBoost supports some traditional methods of categorical data preprocessing, such as
- One-hot Encoding and
- Frequency Encoding.

Note: It also **support text data as input** (and encodes it internally)
- Does tokenization, creates dict and applies BoW based on dict size.



Extra read: https://catboost.ai/en/concepts/algorithm-main-stages_text-to-numeric


#### 3. Avoiding leakage using Ordered encoding

#### How do we do categorical encoding traditionally?


- 2 categories ? => binary encoding
- More than 2 ? => One hot encoding
- One hot increases the sparsity of the data => use target encdoing
- But the target encoding introduces leakage in the data.
    - as it is based on the distribution of target variable


#### How to solve the issue when there are more than > 2 categaries ?

In order to deal with this problem,
- they introduced a noval technique to encode features i.e. **ordered target encoding**
- Ordered target encoding has been built upon target encoding but it takes into account the order of the data point

Let's see how it does that


#### Intuition behind ordered target encoding

Let's look into how does ordered encoding work

Say, we have the following data:



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/390/original/z.png?1706172944' width=600></center>






The data has 2 features and a target variable:
- Color (categorical)
- height (numerical)
- Loves Troll 2 (target variable)

In order to encode the Color column using Ordered target encoding,

1. it'll first randomize the data (a random permutation)



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/391/original/z1.png?1706173231' width=800></center>








2. It'll perform target encoding in such way
    - in order to calulcate the value of current category
    - it'll only consider datapoint which are before the current datapoint

Let's see that using an example:

Say, we have to encode the first datapoint:



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/392/original/z3.png?1706173622' width=800></center>










Once it is done encoding it, it'll move the 2nd datapoint.





<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/393/original/z4.png?1706173842' width=800></center>




Similarly, it'll encode the categorical feature of 3rd datapoint



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/395/original/z5.png?1706174034' width=800></center>






Notice that
- there is an ordering in encoding of datapoints
- i.e. from top to bottom.

hence, the ordered target encoding.

#### (Extra Read) How are the encoding value being generated?

There are multiple options available for generating encoding. You can refer the following artcile if you want to dive deep into it:

- Documentation: https://catboost.ai/en/concepts/algorithm-main-stages_cat-to-numberic


If you want to dive deep into internals of catboost, you can refer the following resources:

research paper: https://arxiv.org/pdf/1706.09516.pdf



### Hyperparamters - Catboost

Catboost provides numerous hyperparams to train on
- to the point that it can be overwhelming.



documentation: https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier

Hyperparam list: https://catboost.ai/en/docs/references/training-parameters/



<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/401/original/z6.png?1706174909' width=800></center>






