# XGBoost Classifier
XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way.

## Advantages
- High Performance: Known for its speed and performance, often outperforms other gradient boosting algorithms.
- Scalability: Can handle large datasets efficiently due to its distributed computing capabilities.
- Flexibility: Supports multiple objective functions, including regression, classification, and ranking.
- Regularization: Includes built-in L1 and L2 regularization to prevent overfitting.
- Tree Pruning: Uses a technique called "max depth pruning" to avoid overfitting and create more generalizable models.

## Disadvantages
- Complexity: More complex to understand and tune compared to simpler models.
- Resource Intensive: Can be resource-intensive in terms of memory and computation, especially for large datasets.
- Sensitive to Hyperparameters: Requires careful tuning of hyperparameters to achieve optimal performance.

## Use Cases
- Finance: Credit scoring, fraud detection, and risk management.
- Marketing: Customer segmentation, churn prediction, and recommendation systems.
- Healthcare: Disease prediction, patient risk stratification, and diagnostic analysis.
- E-commerce: Product recommendation, inventory forecasting, and sales prediction.

## Scaling(not necessarily)
XGBoost does not require feature scaling because it is based on decision tree algorithms, which are not sensitive to the scale of the features.

## Encoding(necessary) 
Categorical data needs to be encoded into numerical values.

# Import library

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
import xgboost as xgb
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_iris
from scipy.stats import uniform, loguniform

# Read data

In [2]:
df = pd.read_csv('Breast_Cancer.csv')
x = df.drop('diagnosis',axis=1)
y = df['diagnosis']

In [3]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Scale data

In [4]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Train

## Grid Search

In [5]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

xgb_clf = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss')

params = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

param_grid = {}


grid_search = GridSearchCV(xgb_clf, params, scoring='accuracy', cv=5, n_jobs=-1, verbose=2)

# Train the grid search
grid_search.fit(x_train, y_train)  

Fitting 5 folds for each of 324 candidates, totalling 1620 fits


In [6]:
print("Best Hyperparameter Index:", grid_search.best_index_)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

Best Hyperparameter Index: 63
Best Hyperparameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 100, 'subsample': 1.0}
Best Cross-Validated Score: 0.9736263736263735


In [7]:
# Get the model with best hyperparameters
model = grid_search.best_estimator_
y_pred = model.predict(x_test)

## Randomized Search

In [8]:
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV

xgb_clf = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss')

params = {
    'n_estimators': np.arange(50, 200, 10),
    'learning_rate': np.linspace(0.01, 0.2, 20),
    'max_depth': np.arange(3, 10),
    'min_child_weight': np.arange(1, 10),
    'subsample': np.linspace(0.7, 1.0, 5),
    'colsample_bytree': np.linspace(0.7, 1.0, 5)
}

param_dist = {}


random_search = RandomizedSearchCV(xgb_clf, params, scoring='accuracy', n_iter=100, cv=5, random_state=42, n_jobs=-1, verbose=2)

# Train the random search
random_search.fit(x_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [9]:
print("Best Hyperparameter Index:", random_search.best_index_)
print("Best Hyperparameters:", random_search.best_params_)
print("Best Cross-Validated Score:", random_search.best_score_)

Best Hyperparameter Index: 37
Best Hyperparameters: {'subsample': 0.85, 'n_estimators': 140, 'min_child_weight': 5, 'max_depth': 5, 'learning_rate': 0.14, 'colsample_bytree': 0.925}
Best Cross-Validated Score: 0.9736263736263737


In [10]:
model = random_search.best_estimator_
y_pred = model.predict(x_test)

## Train DecisionTreeClassifier without search

In [11]:
import xgboost as xgb
model=xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss',colsample_bytree= 0.8, learning_rate= 0.1, max_depth= 3, min_child_weight= 3, n_estimators= 100, subsample= 1.0)
model.fit(x_train, y_train)