# XGBoost Gradient Boosting - Complete Guide

<!--
Project: XGBoost Gradient Boosting
Author: Molla Samser (Founder)
Designer & Tester: Rima Khatun
Website: https://rskworld.in
Email: help@rskworld.in, support@rskworld.in
Phone: +91 93305 39277
Address: Nutanhat, Mongolkote, Purba Burdwan, West Bengal, India, 713147
GitHub: https://github.com/rskworld
-->

This comprehensive guide demonstrates advanced XGBoost techniques including:
- Model training and evaluation
- Hyperparameter tuning
- Cross-validation
- Feature importance analysis
- Model interpretation with SHAP


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Designer & Tester: Rima Khatun
# Website: https://rskworld.in
# Email: help@rskworld.in, support@rskworld.in
# Phone: +91 93305 39277
# Address: Nutanhat, Mongolkote, Purba Burdwan, West Bengal, India, 713147
# GitHub: https://github.com/rskworld

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# XGBoost and sklearn
import xgboost as xgb
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.datasets import make_classification, make_regression
# LabelEncoder available in utils/data_loader.py if needed

# Model interpretation
try:
    import shap
    SHAP_AVAILABLE = True
except ImportError:
    SHAP_AVAILABLE = False
    print("SHAP not available. Install with: pip install shap")

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"XGBoost version: {xgb.__version__}")


## 1. Data Preparation

We'll create synthetic datasets for both classification and regression tasks to demonstrate XGBoost capabilities.


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Designer & Tester: Rima Khatun
# Website: https://rskworld.in

# Create classification dataset
X_class, y_class = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    random_state=42
)

# Create regression dataset
X_reg, y_reg = make_regression(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    noise=10,
    random_state=42
)

# Convert to DataFrames for better visualization
feature_names = [f'feature_{i+1}' for i in range(20)]
df_class = pd.DataFrame(X_class, columns=feature_names)
df_class['target'] = y_class

df_reg = pd.DataFrame(X_reg, columns=feature_names)
df_reg['target'] = y_reg

print("Classification Dataset:")
print(df_class.head())
print(f"\nShape: {df_class.shape}")
print(f"Target distribution:\n{df_class['target'].value_counts()}")

print("\n" + "="*50)
print("\nRegression Dataset:")
print(df_reg.head())
print(f"\nShape: {df_reg.shape}")
print(f"Target statistics:\n{df_reg['target'].describe()}")


## 2. Basic XGBoost Classification Model


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Split classification data
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42, stratify=y_class
)

# Create and train XGBoost classifier
xgb_classifier = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss'
)

# Train the model
xgb_classifier.fit(
    X_train_class, y_train_class,
    eval_set=[(X_test_class, y_test_class)],
    verbose=False
)

# Make predictions
y_pred_class = xgb_classifier.predict(X_test_class)
y_pred_proba = xgb_classifier.predict_proba(X_test_class)[:, 1]

# Evaluate
accuracy = accuracy_score(y_test_class, y_pred_class)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_class, y_pred_class))


## 3. Basic XGBoost Regression Model


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Split regression data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Create and train XGBoost regressor
xgb_regressor = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='rmse'
)

# Train the model
xgb_regressor.fit(
    X_train_reg, y_train_reg,
    eval_set=[(X_test_reg, y_test_reg)],
    verbose=False
)

# Make predictions
y_pred_reg = xgb_regressor.predict(X_test_reg)

# Evaluate
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R² Score: {r2:.4f}")


## 4. Hyperparameter Tuning with GridSearchCV


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Define parameter grid
param_grid = {
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create base model
xgb_base = xgb.XGBClassifier(
    objective='binary:logistic',
    random_state=42,
    eval_metric='logloss'
)

# Grid search with cross-validation
grid_search = GridSearchCV(
    estimator=xgb_base,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# Fit grid search
grid_search.fit(X_train_class, y_train_class)

# Best parameters and score
print("Best Parameters:")
print(grid_search.best_params_)
print(f"\nBest Cross-Validation Score: {grid_search.best_score_:.4f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test_class, y_test_class)
print(f"Test Set Accuracy: {test_accuracy:.4f}")


## 5. Cross-Validation Techniques


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Create model
xgb_cv = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss'
)

# Perform cross-validation
cv_scores = cross_val_score(
    xgb_cv, X_class, y_class,
    cv=kfold,
    scoring='accuracy',
    n_jobs=-1
)

print("Cross-Validation Results:")
print(f"Mean Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"Individual Scores: {cv_scores}")
print(f"Min Score: {cv_scores.min():.4f}")
print(f"Max Score: {cv_scores.max():.4f}")

# Visualize CV scores
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cv_scores) + 1), cv_scores, 'o-', linewidth=2, markersize=8)
plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', label=f'Mean: {cv_scores.mean():.4f}')
plt.fill_between(range(1, len(cv_scores) + 1),
                 cv_scores.mean() - cv_scores.std(),
                 cv_scores.mean() + cv_scores.std(),
                 alpha=0.2, color='gray')
plt.xlabel('Fold', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('K-Fold Cross-Validation Scores', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


## 6. Feature Importance Analysis


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Train model for feature importance
xgb_importance = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)
xgb_importance.fit(X_train_class, y_train_class)

# Get feature importance
importance_gain = xgb_importance.feature_importances_
importance_dict = dict(zip(feature_names, importance_gain))
importance_sorted = sorted(importance_dict.items(), key=lambda x: x[1], reverse=True)

# Create DataFrame
importance_df = pd.DataFrame(importance_sorted, columns=['Feature', 'Importance'])
importance_df = importance_df.sort_values('Importance', ascending=True)

# Plot feature importance
plt.figure(figsize=(10, 8))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='steelblue')
plt.xlabel('Importance (Gain)', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('XGBoost Feature Importance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("Top 10 Most Important Features:")
print(importance_df.tail(10).to_string(index=False))


## 7. Model Interpretation with SHAP


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

if SHAP_AVAILABLE:
    # Create SHAP explainer
    explainer = shap.TreeExplainer(xgb_importance)
    
    # Calculate SHAP values for a sample
    shap_values = explainer.shap_values(X_test_class[:100])
    
    # Summary plot
    plt.figure(figsize=(10, 8))
    shap.summary_plot(shap_values, X_test_class[:100], feature_names=feature_names, show=False)
    plt.title('SHAP Summary Plot - Feature Impact on Model Output', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Feature importance plot
    plt.figure(figsize=(10, 8))
    shap.summary_plot(shap_values, X_test_class[:100], feature_names=feature_names, 
                     plot_type="bar", show=False)
    plt.title('SHAP Feature Importance', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("SHAP analysis completed successfully!")
else:
    print("SHAP is not installed. Install with: pip install shap")


## 8. Learning Curves and Model Performance


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Get evaluation results
results = xgb_classifier.evals_result()
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)

# Plot learning curves
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_0']['logloss'], label='Test')
ax.legend()
ax.set_xlabel('Epochs', fontsize=12)
ax.set_ylabel('Log Loss', fontsize=12)
ax.set_title('XGBoost Learning Curve', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Confusion Matrix
cm = confusion_matrix(y_test_class, y_pred_class)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()


## 9. Advanced Techniques

### Early Stopping
XGBoost supports early stopping to prevent overfitting by monitoring validation performance.


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Model with early stopping
xgb_early_stop = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=1000,  # Set high, early stopping will control
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss'
)

# Train with early stopping
xgb_early_stop.fit(
    X_train_class, y_train_class,
    eval_set=[(X_test_class, y_test_class)],
    early_stopping_rounds=10,
    verbose=False
)

print(f"Best iteration: {xgb_early_stop.best_iteration}")
print(f"Best score: {xgb_early_stop.best_score:.4f}")

# Evaluate
early_stop_pred = xgb_early_stop.predict(X_test_class)
early_stop_accuracy = accuracy_score(y_test_class, early_stop_pred)
print(f"Test Accuracy with Early Stopping: {early_stop_accuracy:.4f}")


## 10. Model Persistence

Save and load trained models for future use.


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Save model
xgb_classifier.save_model('xgboost_model.json')
print("Model saved successfully!")

# Load model
loaded_model = xgb.XGBClassifier()
loaded_model.load_model('xgboost_model.json')
print("Model loaded successfully!")

# Verify loaded model
loaded_pred = loaded_model.predict(X_test_class)
loaded_accuracy = accuracy_score(y_test_class, loaded_pred)
print(f"Loaded model accuracy: {loaded_accuracy:.4f}")
print(f"Original model accuracy: {accuracy:.4f}")
print(f"Models match: {np.array_equal(y_pred_class, loaded_pred)}")


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Create multi-class dataset
X_multi, y_multi = make_classification(
    n_samples=1500,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=3,
    n_clusters_per_class=1,
    random_state=42
)

# Split data
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42, stratify=y_multi
)

# Create multi-class classifier
xgb_multi = xgb.XGBClassifier(
    objective='multi:softprob',
    num_class=3,
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='mlogloss'
)

# Train
xgb_multi.fit(
    X_train_multi, y_train_multi,
    eval_set=[(X_test_multi, y_test_multi)],
    verbose=False
)

# Predictions
y_pred_multi = xgb_multi.predict(X_test_multi)
y_pred_proba_multi = xgb_multi.predict_proba(X_test_multi)

# Evaluate
multi_accuracy = accuracy_score(y_test_multi, y_pred_multi)
print(f"Multi-Class Accuracy: {multi_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_multi, y_pred_multi))

# Confusion Matrix
cm_multi = confusion_matrix(y_test_multi, y_pred_multi)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_multi, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Multi-Class Confusion Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()


## 12. Model Ensemble

Combine multiple XGBoost models for better performance.


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Create multiple models with different configurations
models_ensemble = []
configs = [
    {'max_depth': 3, 'learning_rate': 0.1, 'n_estimators': 100},
    {'max_depth': 6, 'learning_rate': 0.1, 'n_estimators': 100},
    {'max_depth': 9, 'learning_rate': 0.05, 'n_estimators': 200},
]

for i, config in enumerate(configs):
    model = xgb.XGBClassifier(
        objective='binary:logistic',
        random_state=42 + i,
        eval_metric='logloss',
        **config
    )
    model.fit(X_train_class, y_train_class, eval_set=[(X_test_class, y_test_class)], verbose=False)
    models_ensemble.append(model)

# Ensemble predictions (voting)
predictions = np.array([model.predict(X_test_class) for model in models_ensemble])
ensemble_pred = (predictions.mean(axis=0) > 0.5).astype(int)

# Evaluate
ensemble_accuracy = accuracy_score(y_test_class, ensemble_pred)
print(f"Ensemble Accuracy: {ensemble_accuracy:.4f}")

# Individual accuracies
print("\nIndividual Model Accuracies:")
for i, model in enumerate(models_ensemble):
    pred = model.predict(X_test_class)
    acc = accuracy_score(y_test_class, pred)
    print(f"  Model {i+1}: {acc:.4f}")


## 13. Feature Engineering

Create new features to improve model performance.


In [None]:
# Project: XGBoost Gradient Boosting
# Author: Molla Samser (Founder)
# Website: https://rskworld.in

# Create DataFrame with features
df_features = pd.DataFrame(X_class, columns=feature_names)

# Feature engineering
df_features['feature_1_squared'] = df_features['feature_1'] ** 2
df_features['feature_2_squared'] = df_features['feature_2'] ** 2
df_features['feature_1_x_feature_2'] = df_features['feature_1'] * df_features['feature_2']
df_features['feature_mean'] = df_features[feature_names].mean(axis=1)
df_features['feature_std'] = df_features[feature_names].std(axis=1)

# Prepare engineered data
X_engineered = df_features.values
X_train_eng, X_test_eng, y_train_eng, y_test_eng = train_test_split(
    X_engineered, y_class, test_size=0.2, random_state=42, stratify=y_class
)

# Train model with engineered features
xgb_eng = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)

xgb_eng.fit(X_train_eng, y_train_eng, eval_set=[(X_test_eng, y_test_eng)], verbose=False)

# Evaluate
y_pred_eng = xgb_eng.predict(X_test_eng)
eng_accuracy = accuracy_score(y_test_eng, y_pred_eng)

print(f"Original Features Accuracy: {accuracy:.4f}")
print(f"Engineered Features Accuracy: {eng_accuracy:.4f}")
print(f"Improvement: {eng_accuracy - accuracy:.4f}")


## Conclusion

This notebook demonstrated:
- ✅ Basic XGBoost classification and regression
- ✅ Hyperparameter tuning with GridSearchCV
- ✅ Cross-validation techniques
- ✅ Feature importance analysis
- ✅ Model interpretation with SHAP
- ✅ Early stopping
- ✅ Model persistence

For more advanced techniques and custom implementations, refer to the Python scripts in this project.

---

**Project Information:**
- **Author:** Molla Samser (Founder)
- **Designer & Tester:** Rima Khatun
- **Website:** https://rskworld.in
- **Email:** help@rskworld.in, support@rskworld.in
- **Phone:** +91 93305 39277
- **Address:** Nutanhat, Mongolkote, Purba Burdwan, West Bengal, India, 713147
- **GitHub:** https://github.com/rskworld
