<a href="https://colab.research.google.com/github/samiha-mahin/A-Machine-Learning-Models-Repo/blob/main/CatBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**CatBoost**

## 🌟 What is CatBoost?

**CatBoost** stands for **"Categorical Boosting."**
It’s a type of **gradient boosting** algorithm, just like XGBoost and LightGBM —
but it’s **specially designed to handle categorical features** (like names, colors, job titles, etc.) automatically and **beautifully**.

---

## 🏡 Real-Life Example: Predicting Car Prices

Let’s say you’re building a model to predict car prices using:

* Brand (`Toyota`, `BMW`, `Ford`)
* Fuel type (`Petrol`, `Diesel`, `Electric`)
* Year
* Engine size
* Transmission (`Manual`, `Automatic`)

🔸 Most ML models struggle with the **text-type features** (brand, fuel, transmission).
You have to convert them using tricks like One-Hot or Label Encoding manually.

But...

> 💖 **CatBoost handles these text (categorical) features automatically and smartly!** No need for manual encoding.

---

## 🧠 How CatBoost Works (Easily):

Just like Gradient Boosting:

1. Start with a simple prediction (like average price).
2. Find the **errors** (residuals).
3. Build a small tree to predict those errors.
4. Add that tree’s predictions to improve the model.
5. Repeat this many times.

BUT CatBoost adds **two special powers**:

---

### 🔮 1. Categorical Feature Magic ✨

* It **automatically understands categorical features** without you doing any encoding!
* It converts them using **statistics** (like how brand affects price), not dummy variables.

---

### 🛡️ 2. Overfitting Protection

* CatBoost uses a smart technique called **Ordered Boosting** to prevent the model from memorizing the training data too much.
* Result: **Better generalization** on test data!

---

## 🚀 Advantages of CatBoost

| Feature                     | Why It’s Great for You                            |
| --------------------------- | ------------------------------------------------- |
| 🐱 Handles Categorical Data | No need to encode — saves time and effort         |
| 🚀 Fast and Accurate        | Competes with LightGBM and XGBoost in speed       |
| 🛡️ Less Overfitting        | Smart tricks reduce the chance of bad predictions |
| 🧹 Clean Workflow           | Works great even with messy data                  |

---

## ✅ When Should You Use CatBoost?

* You have **lots of categorical features** (like cities, brands, types).
* You want **high accuracy** with **minimal data preprocessing**.
* You’re dealing with **tabular data** (rows and columns).
* You want a **plug-and-play model** with less manual work.

---

## 💡 Summary Table (Super Simple)

| Concept              | CatBoost Magic                                 |
| -------------------- | ---------------------------------------------- |
| Model Type           | Gradient Boosting                              |
| Special Skill        | Auto-handling of categorical features 🐱       |
| Overfitting Control  | Yes (ordered boosting)                         |
| Accuracy             | Very high                                      |
| Speed                | Fast (a bit slower than LightGBM, but cleaner) |
| Preprocessing Needed | Very little (just give it raw data!)           |

---

## 🔁 CatBoost vs XGBoost vs LightGBM (Quick View)

| Feature            | XGBoost     | LightGBM      | **CatBoost**       |
| ------------------ | ----------- | ------------- | ------------------ |
| Handles Categories | ❌ No        | ❌ No          | ✅ **Yes**          |
| Speed              | Fast        | **Very Fast** | Medium-Fast        |
| Accuracy           | High        | High          | **Very High**      |
| Overfitting        | Can overfit | Can overfit   | ✅ Less overfitting |




# **CatBoost on Titanic Dataset**

In [2]:
%pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [3]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier

# Load Titanic dataset
data = pd.read_csv('/content/titanic.csv')

# Select features and target
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = data[features]
y = data['Survived']

# Handle missing values
X['Age'].fillna(X['Age'].mean(), inplace=True)
X['Embarked'].fillna(X['Embarked'].mode()[0], inplace=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify categorical features (CatBoost needs column indices, not names)
# CatBoost can handle categorical features by name if you pass them as a list of strings
cat_features = ['Pclass', 'Sex', 'Embarked']


# Initialize CatBoostClassifier
model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    cat_features=cat_features,
    verbose=0,  # Set to 100 to see training output
    random_state=42
)

# Train model
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"CatBoost Accuracy on Titanic: {accuracy:.2f}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X['Age'].fillna(X['Age'].mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Age'].fillna(X['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) 

CatBoost Accuracy on Titanic: 0.80
