# CatBoost Regression
CatBoost (Categorical Boosting) is a gradient boosting algorithm developed by Yandex. It is designed to handle categorical features directly, without the need for extensive preprocessing. CatBoost builds on the principles of decision trees and gradient boosting to provide high accuracy and efficient performance.

## Advantages:
- Handles Categorical Features: Directly processes categorical features without the need for extensive preprocessing like one-hot encoding.
- Robustness to Overfitting: Includes built-in mechanisms to reduce overfitting.
- High Performance: Often provides superior performance and accuracy compared to other gradient boosting methods.
- Ease of Use: Requires minimal data preprocessing and parameter tuning.
- GPU Support: Can leverage GPU for faster training.

## Disadvantages:
- Complexity: Can be more complex to understand and interpret compared to simpler models.
- Computationally Intensive: Training can be computationally intensive, especially with large datasets.
- Parameter Tuning: While less intensive than some methods, it still requires careful tuning for optimal performance.

### Use Cases:
- Finance: Credit scoring, risk assessment, stock price prediction.
- Healthcare: Disease progression prediction, patient risk stratification.
- Marketing: Customer segmentation, response modeling, sales forecasting.
- E-commerce: Recommendation systems, customer churn prediction.

## Scaling
- Scaling: No, scaling is not necessary for CatBoost because it handles feature scaling internally.

## Encoding
- Encoding: No, CatBoost handles categorical features natively and does not require manual encoding like one-hot encoding.

In [5]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error


In [6]:
# Sample dataset (replace with your data)
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5],
    'feature2': ['A', 'B', 'A', 'B', 'C'],
    'target': [10, 20, 15, 25, 30]
})


In [7]:
# Features and target
X = data.drop('target', axis=1)
y = data['target']

# Preprocessing for categorical features
categorical_features = ['feature2']
numeric_features = ['feature1']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Train

## Grid Search

In [None]:

cat_boost = CatBoostRegressor(verbose=0, random_state=42)

# Parameter grid for GridSearchCV
param_grid = {
    'iterations': [100, 200, 300, 500, 1000],
    'depth': [4, 6, 8, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.15],
    'l2_leaf_reg': [1, 3, 5, 7, 9],
    'bagging_temperature': [0, 0.5, 1, 1.5, 2]
}

param_grid = {
    'iterations': [100, 200, 300, 500, 1000],
    'depth': [4, 6, 8, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.15],
    'l2_leaf_reg': [1, 3, 5, 7, 9],
    'bagging_temperature': [0, 0.5, 1, 1.5, 2],
    'random_strength': [0, 0.5, 1, 1.5, 2],
    'border_count': [32, 64, 128, 254],
    'grow_policy': ['SymmetricTree', 'Depthwise', 'Lossguide']
}

# Ensure that the number of splits in cross-validation is less than the number of samples
cv_splits = min(len(y_train), 3)

# GridSearchCV
grid_search = GridSearchCV(cat_boost, param_grid, cv=cv_splits, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train,  cat_features=categorical_features)


In [None]:
print("Best Hyperparameter Index:", grid_search.best_index_)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

## Randomized Search

In [None]:

cat_boost = CatBoostRegressor(verbose=0, random_state=42)

# Parameter distribution for RandomizedSearchCV
param_dist = {
    'iterations': np.arange(100, 1001, 100),
    'depth': np.arange(4, 11),
    'learning_rate': np.linspace(0.01, 0.2, 20),
    'l2_leaf_reg': np.arange(1, 10, 1),
    'bagging_temperature': np.linspace(0, 2, 20)
}

param_dist = {
    'iterations': np.arange(100, 1001, 100),
    'depth': np.arange(4, 11),
    'learning_rate': np.linspace(0.01, 0.2, 20),
    'l2_leaf_reg': np.arange(1, 10, 1),
    'bagging_temperature': np.linspace(0, 2, 20),
    'random_strength': np.linspace(0, 2, 20),
    'border_count': np.arange(32, 255, 32),
    'grow_policy': ['SymmetricTree', 'Depthwise', 'Lossguide']
}

# Ensure that the number of splits in cross-validation is less than the number of samples
cv_splits = min(len(y_train), 3)

# RandomizedSearchCV
random_search = RandomizedSearchCV(cat_boost, param_dist, n_iter=200, cv=cv_splits, scoring='neg_mean_squared_error', random_state=42)
random_search.fit(X_train, y_train,  cat_features=categorical_features)


In [None]:
print("Best Hyperparameter Index:", random_search.best_index_)
print("Best Hyperparameters:", random_search.best_params_)
print("Best Cross-Validated Score:", random_search.best_score_)

In [None]:
# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error on Test Set: {mse}")

## Train cat_boost without search