# Chapter 1

### XGBoost

- XGBoost under the hood:
    - uses an ensemble model that uses many weak base CART learners into a strong learner
    - These weak base learners are only slightly better at prediction than pure random chance
    - For each learner, the contribution is calculated as weights
    - To get the final output, the weighted sum of all weak learners determine the model output. 
    - (weight1 * model1 + weight2 + model2 + .. = output)
- Advantages:
    - Fast and efficient
    - Core algorithm is parallelizable
    - Consistently outperforms single-algorithm methods
    - State-of-the-art performance in many ML tasks
    - Uses CART (Classification and Regression Tree)
    - individual decision trees contain data as decision value at leaves, that is why these trees tend to overfit
    - XGBoost tree contains real-valued score at leaves which are generalized numeric values than can be used as threshold that can even help for classification
- When to use:
    - For > 1000 samples
    - For < 100 features
    - just numeric features or mixture of numeric and categorical features
- When not to use
    - image processing
    - natural language processing

```
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)
# Simple fit-predict
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
xg_cl.fit(X_train, y_train)
preds = xg_cl.predict(X_test)

# Cross validation (Method 1)
dmatrix = xgb.DMatrix(data=X_train, label=y_train)
params={"objective":"binary:logistic","max_depth":4}
cv_results = xgb.cv(dtrain=dmatrix, params=params, nfold=4, num_boost_round=10, 
        metrics="error", as_pandas=True, stratified=True, early_stopping_rounds=10, verbose_eval=1)
# accuracy_cv = 1 - cv_results['test-error-mean'].iloc[-1]
# Train the final model with the best number of boosting rounds
best_num_boost_round = len(cv_results)
final_model = xgb.train(params, dmatrix, num_boost_round=best_num_boost_round)
# Make predictions on the testing dataset
dtest = xgb.DMatrix(X_test)
y_pred_prob = final_model.predict(dtest)
y_pred_binary = np.round(y_pred_prob)  # Convert probabilities to binary predictions
accuracy_final = accuracy_score(y_test, y_pred_binary)

# Cross validation (Method 2)
from sklearn.model_selection import cross_val_score, StratifiedKFold
xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(xgb_model, X_train, y_train, cv=cv, scoring='accuracy')
from sklearn.model_selection import cross_val_predict
y_pred_cv = cross_val_predict(xgb_model, X_test, y=None, cv=cv)
accuracy_final = accuracy_score(y_test, y_pred_cv)
```