# XGBoost (Gradient Boosting)

## Problem Type
**XGBoost (eXtreme Gradient Boosting)** is primarily used for:
- **Classification** problems
- **Regression** problems
- **Supervised** learning

### How XGBoost Works
- **Boosting technique:**
  - Sequentially builds models that correct the errors of previous models.
  - Each new model focuses more on the misclassified examples from the previous model.
- **Gradient boosting:**
  - Models are trained using gradient descent to minimize a loss function, making the approach effective for both regression and classification tasks.
- **Weighted ensemble:**
  - The final prediction is a weighted sum of all individual model predictions.
  - Reduces bias and variance, leading to better generalization.
- **Regularization:**
  - Includes both L1 (Lasso) and L2 (Ridge) regularization to prevent overfitting.
- **Efficient computation:**
  - Utilizes advanced techniques like tree pruning, parallel processing, and out-of-core computing to optimize both speed and performance.

### Key Tuning Metrics
- **`n_estimators`:**
  - **Description:** Number of boosting rounds (trees) to be built.
  - **Impact:** More trees generally improve performance but can lead to overfitting if too high.
  - **Default:** `100`.
- **`learning_rate`:**
  - **Description:** Shrinks the contribution of each tree by the learning rate.
  - **Impact:** Smaller values require more trees but provide better generalization; typically in the range of `0.01` to `0.3`.
  - **Default:** `0.1`.
- **`max_depth`:**
  - **Description:** Maximum depth of a tree.
  - **Impact:** Deeper trees can capture more information but may lead to overfitting; often set between `3` and `10`.
  - **Default:** `6`.
- **`min_child_weight`:**
  - **Description:** Minimum sum of instance weight (hessian) needed in a child.
  - **Impact:** Higher values prevent overfitting by making the algorithm more conservative.
  - **Default:** `1`.
- **`subsample`:**
  - **Description:** Proportion of training data used to grow each tree.
  - **Impact:** Values less than `1.0` prevent overfitting; typically in the range of `0.5` to `1.0`.
  - **Default:** `1.0`.
- **`colsample_bytree`:**
  - **Description:** Fraction of features used for each tree.
  - **Impact:** Reducing this can help prevent overfitting and speed up training.
  - **Default:** `1.0`.
- **`gamma`:**
  - **Description:** Minimum loss reduction required to make a further partition on a leaf node.
  - **Impact:** Higher values make the algorithm more conservative by preventing unnecessary splits.
  - **Default:** `0`.
- **`reg_alpha` and `reg_lambda`:**
  - **Description:** L1 and L2 regularization terms, respectively.
  - **Impact:** Higher values add more penalty, reducing overfitting.
  - **Default:** `0` (no regularization).

### Pros vs Cons

| Pros                                                  | Cons                                                   |
|-------------------------------------------------------|--------------------------------------------------------|
| High predictive accuracy                              | Computationally expensive; requires significant resources |
| Built-in regularization to prevent overfitting        | Complex model that can be difficult to interpret       |
| Handles missing data internally                       | Sensitive to noisy data                                |
| Supports parallel processing for faster training      | Requires careful tuning of many hyperparameters        |
| Scales well to large datasets                         | Tendency to overfit if `n_estimators` or `max_depth` are too high |
| Works well with imbalanced datasets                   | May require significant trial and error in tuning      |

### Evaluation Metrics
- **Accuracy (Classification):**
  - **Description:** Ratio of correct predictions to total predictions.
  - **Good Value:** Higher is better; values above 0.85 typically indicate good performance.
  - **Bad Value:** Below 0.5 suggests poor model performance.
- **Precision (Classification):**
  - **Description:** Proportion of positive identifications that were actually correct.
  - **Good Value:** Higher values indicate fewer false positives; important in cases with imbalanced classes.
  - **Bad Value:** Low values suggest many false positives.
- **Recall (Classification):**
  - **Description:** Proportion of actual positives that were correctly identified.
  - **Good Value:** Higher values indicate fewer false negatives; crucial in recall-sensitive applications.
  - **Bad Value:** Low values suggest many false negatives.
- **F1 Score (Classification):**
  - **Description:** Harmonic mean of Precision and Recall.
  - **Good Value:** Higher values indicate a good balance between Precision and Recall.
  - **Bad Value:** Low values suggest a poor balance, with either high false positives or false negatives.
- **Mean Squared Error (MSE) (Regression):**
  - **Description:** Average of the squared differences between predicted and actual values.
  - **Good Value:** Lower values indicate better model performance; values close to 0 are ideal.
  - **Bad Value:** High values suggest poor predictive accuracy.
- **R-squared (R²) (Regression):**
  - **Description:** Proportion of variance in the dependent variable predictable from the independent variables.
  - **Good Value:** Closer to 1 indicates a strong model fit.
  - **Bad Value:** Values near 0 suggest the model does not explain much of the variance.
- **Log Loss (Classification):**
  - **Description:** Measures the performance of a classification model where the output is a probability value between 0 and 1.
  - **Good Value:** Lower values are better; close to 0 indicates strong performance.
  - **Bad Value:** Higher values indicate poor model calibration.



In [None]:
import xgboost as xgb
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score, classification_report, log_loss
from sklearn.model_selection import train_test_split

In [None]:
# Load the wine dataset
wine = load_wine()

# Separate features and target variable
X = wine.data  # Features
y = wine.target  # Target labels (wine type)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
# Define XGBoost parameters with your specified values
params = {
    "objective": "multi:softmax",  # Multi-class classification
    "num_class": 3,  # Number of classes
    "n_estimators": 1,  # Number of trees (boosting rounds)
    "learning_rate": 0.1,  # Learning rate
    "max_depth": 5,  # Maximum depth of each tree
    "min_child_weight": 3,  # Minimum sum of instance weight (hessian) in a child
    "subsample": 0.8,  # Proportion of training data used for each tree
    "colsample_bytree": 0.8,  # Fraction of features used for each tree
    "gamma": 0.1,  # Minimum loss reduction to make a further partition on a leaf node
    "reg_alpha": 10,  # L1 regularization term
    "reg_lambda": 1,  # L2 regularization term
    "seed": 42,  # Random seed for reproducibility
}

# Initialize the XGBoost classifier with the specified parameters
model = xgb.XGBClassifier(**params)

# Train the model using the training data
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Get the predicted class probabilities
y_pred_proba = model.predict_proba(X_test)

In [None]:
# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=wine.target_names)

print(f'Accuracy: {accuracy:.2f}')
print(f"Log Loss: {log_loss(y_test, y_pred_proba)}")
print('Classification Report:')
print(report)

In [None]:
booster = model.get_booster()

# Get the number of trees
# num_trees = model.n_estimators

# Alternative way to get the total number of trees directly from the booster
num_trees_boost = booster.trees_to_dataframe()['Tree'].max() + 1

print(f"Total number of trees: {num_trees_boost}")

In [None]:
# Plot all trees
for i in range(num_trees_boost):
    xgb.plot_tree(model, num_trees=i)