# CatBoost Classifier
CatBoost (Categorical Boosting) is a high-performance gradient boosting library that is particularly efficient in handling categorical features. Developed by Yandex, CatBoost aims to provide state-of-the-art results without extensive hyperparameter tuning and preprocessing.

## Advantages
- Handles Categorical Features: Efficiently handles categorical features without needing extensive preprocessing like one-hot encoding.
- High Accuracy: Often achieves high accuracy and is competitive with other leading gradient boosting libraries.
- Robust to Overfitting: Implements techniques to prevent overfitting.
- Fast Training: Optimized for fast training and prediction times.
- Minimal Tuning Required: Provides good default settings that work well in many scenarios, reducing the need for extensive hyperparameter tuning.

## Disadvantages
- Complexity: More complex compared to simpler models, making it harder to interpret.
- Requires Significant Resources: Can be resource-intensive in terms of memory and computation, especially for large datasets.
- Sensitive to Hyperparameters: While good defaults are provided, fine-tuning can still be crucial for best performance.

## Use Cases
- Finance: Fraud detection, credit scoring, and risk management.
- Marketing: Customer segmentation, churn prediction, and recommendation systems.
- Healthcare: Disease prediction and patient risk stratification.
- E-commerce: Product recommendation, inventory forecasting, and sales prediction.

## Scaling(not necessarily)
CatBoost does not require feature scaling because it is based on decision tree algorithms which are not sensitive to the scale of the features.

## Encoding(necessary) 
CatBoost can handle categorical features directly without needing to encode them into numerical values. You can specify categorical features using the cat_features parameter.

# Import library

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from catboost import CatBoostClassifier, Pool
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_iris
from scipy.stats import uniform, loguniform

# Read data

In [None]:
df = pd.read_csv('Breast_Cancer.csv')
x = df.drop('diagnosis',axis=1)
y = df['diagnosis']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Scale data

In [None]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Train

## Grid Search

In [None]:
from catboost import CatBoostClassifier
from sklearn.model_selection import GridSearchCV

catboost_clf = CatBoostClassifier(random_state=42, silent=True)

params = {
    'iterations': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'depth': [4, 6, 10],
    'l2_leaf_reg': [1, 3, 5, 7]
}

param_grid = {}


grid_search = GridSearchCV(catboost_clf, params, scoring='accuracy', cv=5, n_jobs=-1)

# Train the grid search
grid_search.fit(x_train, y_train)  

In [None]:
print("Best Hyperparameter Index:", grid_search.best_index_)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

In [None]:
# Get the model with best hyperparameters
model = grid_search.best_estimator_
y_pred = model.predict(x_test)

## Randomized Search

In [None]:
from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV

catboost_clf = CatBoostClassifier(random_state=42, silent=True)

params = {
    'iterations': np.arange(50, 300, 50),
    'learning_rate': np.linspace(0.01, 0.2, 20),
    'depth': np.arange(4, 11),
    'l2_leaf_reg': np.arange(1, 10)
}

param_dist = {}


random_search = RandomizedSearchCV(catboost_clf, params, scoring='accuracy', n_iter=10, cv=5, n_jobs=-1, random_state=42)

# Train the random search
random_search.fit(x_train, y_train)

In [None]:
print("Best Hyperparameter Index:", random_search.best_index_)
print("Best Hyperparameters:", random_search.best_params_)
print("Best Cross-Validated Score:", random_search.best_score_)

In [None]:
model = random_search.best_estimator_
y_pred = model.predict(x_test)

## Train DecisionTreeClassifier without search

In [None]:
from catboost import CatBoostClassifier
model=GradientBoostingClassifier(learning_rate=0.2, min_samples_leaf=4, min_samples_split=10, n_estimators=200, random_state=42)
model.fit(x_train, y_train)