# 4.2 Build XGBoost Models

## Introduction

In the previous notebook, we explored the theory behind gradient boosting. Now we put that knowledge into practice by building **XGBoost** (eXtreme Gradient Boosting) classifiers for our student departure prediction problem.

XGBoost has become one of the most popular machine learning algorithms due to its:
- Excellent predictive performance
- Built-in regularization to prevent overfitting
- Efficient handling of missing values
- Scikit-learn compatible API

### Learning Objectives

By the end of this notebook, you will be able to:

1. Build XGBoost classification models using the scikit-learn API
2. Understand and configure key XGBoost hyperparameters
3. Integrate XGBoost into scikit-learn preprocessing pipelines
4. Handle class imbalance in gradient boosting models
5. Extract and visualize feature importance from XGBoost models

## 1. Setup and Data Loading

### 1.1 Import Libraries

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Scikit-learn
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, classification_report, confusion_matrix
)

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier

print(f"XGBoost version: {xgb.__version__}")
print("Libraries loaded successfully!")

### 1.2 Load Student Data

In [None]:
# Load the student departure dataset
# This dataset contains student information from semesters 1-2 to predict departure in semester 3

# For this notebook, we'll create a synthetic dataset that mirrors real institutional data
np.random.seed(42)
n_students = 2000

# Generate synthetic student data
data = {
    'STUDENT_ID': range(1, n_students + 1),
    'HS_GPA': np.random.normal(3.2, 0.5, n_students).clip(2.0, 4.0),
    'MATH_PLACEMENT': np.random.choice(['Remedial', 'College-Ready', 'Advanced'], n_students, p=[0.2, 0.5, 0.3]),
    'FIRST_GEN': np.random.choice(['Yes', 'No'], n_students, p=[0.35, 0.65]),
    'PELL_ELIGIBLE': np.random.choice(['Yes', 'No'], n_students, p=[0.40, 0.60]),
    'RESIDENCY': np.random.choice(['In-State', 'Out-of-State', 'International'], n_students, p=[0.7, 0.2, 0.1]),
    'UNITS_ATTEMPT_1': np.random.normal(14, 2, n_students).clip(6, 18).astype(int),
    'GPA_1': np.random.normal(2.8, 0.7, n_students).clip(0.0, 4.0),
    'DFW_RATE_1': np.random.beta(2, 8, n_students),
    'UNITS_ATTEMPT_2': np.random.normal(14, 2, n_students).clip(6, 18).astype(int),
    'GPA_2': np.random.normal(2.9, 0.6, n_students).clip(0.0, 4.0),
    'DFW_RATE_2': np.random.beta(2, 8, n_students),
}

df = pd.DataFrame(data)

# Calculate cumulative metrics
df['CUM_GPA'] = (df['GPA_1'] + df['GPA_2']) / 2
df['CUM_UNITS'] = df['UNITS_ATTEMPT_1'] + df['UNITS_ATTEMPT_2']
df['AVG_DFW'] = (df['DFW_RATE_1'] + df['DFW_RATE_2']) / 2

# Generate target variable (DEPARTED) based on realistic factors
departure_prob = (
    0.3  # Base rate
    - 0.15 * (df['CUM_GPA'] - 2.5)  # Lower GPA = higher departure
    + 0.3 * df['AVG_DFW']  # Higher DFW = higher departure
    + 0.05 * (df['FIRST_GEN'] == 'Yes')  # First-gen slightly higher
    - 0.02 * (df['HS_GPA'] - 3.0)  # Lower HS GPA = higher departure
    + 0.05 * (df['MATH_PLACEMENT'] == 'Remedial')  # Remedial math higher
)
departure_prob = departure_prob.clip(0.05, 0.95)
df['DEPARTED'] = np.random.binomial(1, departure_prob)

print(f"Dataset shape: {df.shape}")
print(f"\nDeparture rate: {df['DEPARTED'].mean():.1%}")
df.head()

In [None]:
# Examine the data types and distributions
print("Data Types:")
print(df.dtypes)
print("\n" + "="*50)
print("\nNumerical Features Summary:")
df.describe().round(2)

### 1.3 Train-Test Split

In [None]:
# Define features and target
# Identify categorical and numerical columns
categorical_cols = ['MATH_PLACEMENT', 'FIRST_GEN', 'PELL_ELIGIBLE', 'RESIDENCY']
numerical_cols = ['HS_GPA', 'UNITS_ATTEMPT_1', 'GPA_1', 'DFW_RATE_1', 
                  'UNITS_ATTEMPT_2', 'GPA_2', 'DFW_RATE_2', 
                  'CUM_GPA', 'CUM_UNITS', 'AVG_DFW']

feature_cols = categorical_cols + numerical_cols
target_col = 'DEPARTED'

X = df[feature_cols]
y = df[target_col]

# Train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} students")
print(f"Test set: {X_test.shape[0]} students")
print(f"\nTraining departure rate: {y_train.mean():.1%}")
print(f"Test departure rate: {y_test.mean():.1%}")

## 2. Introduction to XGBoost

### 2.1 XGBoost Key Features

XGBoost extends the basic gradient boosting algorithm with several innovations:

| Feature | Description |
|:--------|:------------|
| **Regularization** | L1 and L2 penalties on leaf weights prevent overfitting |
| **Sparsity Awareness** | Efficiently handles missing values by learning optimal direction |
| **Parallel Processing** | Tree construction parallelized at the feature level |
| **Cache Optimization** | Optimized data structures for CPU cache efficiency |
| **Out-of-Core** | Can train on datasets larger than memory |
| **Tree Pruning** | Uses "max_depth" and prunes trees using cost-complexity |

### 2.2 XGBoost vs scikit-learn GradientBoosting

While scikit-learn has its own `GradientBoostingClassifier`, XGBoost offers several advantages:

In [None]:
# Comparison table
comparison_data = {
    'Aspect': [
        'Speed',
        'Regularization',
        'Missing Values',
        'Parallel Training',
        'GPU Support',
        'Early Stopping',
        'Feature Importance',
        'scikit-learn Compatible'
    ],
    'sklearn GradientBoosting': [
        'Slower',
        'Limited (via tree params)',
        'Requires imputation',
        'No',
        'No',
        'No (manual)',
        'Basic',
        'Yes'
    ],
    'XGBoost': [
        '5-10x faster',
        'Built-in L1/L2',
        'Native support',
        'Yes',
        'Yes',
        'Built-in',
        'Multiple types',
        'Yes (wrapper)'
    ]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df

## 3. Building XGBoost Classifiers

### 3.1 Basic XGBoost Model

Let's start with a simple XGBoost model using default parameters. Note that XGBoost requires numerical features, so we first need to encode our categorical variables.

In [None]:
# First, let's manually encode categorical features for the basic example
# We'll use a pipeline approach later

# One-hot encode categorical columns
X_train_encoded = pd.get_dummies(X_train, columns=categorical_cols, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, columns=categorical_cols, drop_first=True)

# Ensure same columns in train and test
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

print(f"Encoded training features: {X_train_encoded.shape[1]} columns")
print(f"\nEncoded columns:")
print(X_train_encoded.columns.tolist())

In [None]:
# Build a basic XGBoost classifier
xgb_basic = XGBClassifier(
    n_estimators=100,       # Number of boosting rounds (trees)
    max_depth=6,            # Maximum tree depth
    learning_rate=0.1,      # Step size shrinkage
    random_state=42,
    eval_metric='logloss',  # Evaluation metric
    use_label_encoder=False # Suppress warning
)

# Train the model
xgb_basic.fit(X_train_encoded, y_train)

# Make predictions
y_pred = xgb_basic.predict(X_test_encoded)
y_pred_proba = xgb_basic.predict_proba(X_test_encoded)[:, 1]

# Evaluate
print("Basic XGBoost Model Performance")
print("=" * 40)
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_proba):.3f}")

In [None]:
# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=['Predicted Retained', 'Predicted Departed'],
    y=['Actual Retained', 'Actual Departed'],
    text=cm,
    texttemplate='%{text}',
    textfont=dict(size=20),
    colorscale='Blues',
    showscale=True
))

fig.update_layout(
    title='Confusion Matrix: Basic XGBoost Model',
    xaxis_title='Predicted',
    yaxis_title='Actual',
    height=400
)

fig.show()

### 3.2 Understanding Key Parameters

XGBoost has many hyperparameters. Here are the most important ones for classification:

In [None]:
# Key XGBoost parameters explained
params_explanation = {
    'Parameter': [
        'n_estimators',
        'max_depth',
        'learning_rate (eta)',
        'subsample',
        'colsample_bytree',
        'min_child_weight',
        'gamma',
        'reg_alpha',
        'reg_lambda',
        'scale_pos_weight'
    ],
    'Description': [
        'Number of boosting rounds (trees)',
        'Maximum depth of each tree',
        'Step size shrinkage (learning rate)',
        'Fraction of samples used per tree',
        'Fraction of features used per tree',
        'Minimum sum of instance weight in a leaf',
        'Minimum loss reduction for split',
        'L1 regularization (Lasso)',
        'L2 regularization (Ridge)',
        'Balance of positive/negative weights'
    ],
    'Typical Range': [
        '100-1000',
        '3-10',
        '0.01-0.3',
        '0.5-1.0',
        '0.5-1.0',
        '1-10',
        '0-5',
        '0-1',
        '0-1',
        '1 or (neg/pos) ratio'
    ],
    'Effect': [
        'More = more complex, may overfit',
        'Higher = more complex, may overfit',
        'Lower = more regularization',
        'Lower = more regularization',
        'Lower = more regularization',
        'Higher = more conservative',
        'Higher = more conservative',
        'Higher = more regularization (sparse)',
        'Higher = more regularization',
        'Handles class imbalance'
    ]
}

params_df = pd.DataFrame(params_explanation)
params_df

In [None]:
# Demonstrate the effect of key parameters
# Compare different learning rates

learning_rates = [0.01, 0.05, 0.1, 0.3]
results = []

for lr in learning_rates:
    model = XGBClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=lr,
        random_state=42,
        eval_metric='logloss',
        use_label_encoder=False
    )
    model.fit(X_train_encoded, y_train)
    
    train_score = roc_auc_score(y_train, model.predict_proba(X_train_encoded)[:, 1])
    test_score = roc_auc_score(y_test, model.predict_proba(X_test_encoded)[:, 1])
    
    results.append({
        'Learning Rate': lr,
        'Train AUC': train_score,
        'Test AUC': test_score,
        'Gap (Overfit)': train_score - test_score
    })

results_df = pd.DataFrame(results)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=results_df['Learning Rate'],
    y=results_df['Train AUC'],
    mode='lines+markers',
    name='Train AUC',
    line=dict(color='blue', width=2)
))

fig.add_trace(go.Scatter(
    x=results_df['Learning Rate'],
    y=results_df['Test AUC'],
    mode='lines+markers',
    name='Test AUC',
    line=dict(color='green', width=2)
))

fig.update_layout(
    title='Effect of Learning Rate on Model Performance',
    xaxis_title='Learning Rate',
    yaxis_title='ROC-AUC Score',
    height=400,
    xaxis_type='log'
)

fig.show()

print("\nLearning Rate Comparison:")
results_df.round(4)

### 3.3 XGBoost with Class Weights

Student departure prediction often involves imbalanced classes (e.g., 20-30% departure rate). XGBoost provides `scale_pos_weight` to handle this.

In [None]:
# Calculate the class imbalance ratio
negative_count = (y_train == 0).sum()
positive_count = (y_train == 1).sum()
scale_pos_weight = negative_count / positive_count

print(f"Class distribution in training set:")
print(f"  Retained (0): {negative_count} ({negative_count/len(y_train):.1%})")
print(f"  Departed (1): {positive_count} ({positive_count/len(y_train):.1%})")
print(f"\nScale positive weight: {scale_pos_weight:.2f}")

In [None]:
# Compare models with and without class weighting

# Without class weighting
xgb_unweighted = XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss',
    use_label_encoder=False
)
xgb_unweighted.fit(X_train_encoded, y_train)

# With class weighting
xgb_weighted = XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric='logloss',
    use_label_encoder=False
)
xgb_weighted.fit(X_train_encoded, y_train)

# Compare predictions
y_pred_unweighted = xgb_unweighted.predict(X_test_encoded)
y_pred_weighted = xgb_weighted.predict(X_test_encoded)

print("Comparison: Unweighted vs Weighted XGBoost")
print("=" * 50)
print(f"{'Metric':<15} {'Unweighted':>12} {'Weighted':>12}")
print("-" * 50)
print(f"{'Accuracy':<15} {accuracy_score(y_test, y_pred_unweighted):>12.3f} {accuracy_score(y_test, y_pred_weighted):>12.3f}")
print(f"{'Precision':<15} {precision_score(y_test, y_pred_unweighted):>12.3f} {precision_score(y_test, y_pred_weighted):>12.3f}")
print(f"{'Recall':<15} {recall_score(y_test, y_pred_unweighted):>12.3f} {recall_score(y_test, y_pred_weighted):>12.3f}")
print(f"{'F1 Score':<15} {f1_score(y_test, y_pred_unweighted):>12.3f} {f1_score(y_test, y_pred_weighted):>12.3f}")

In [None]:
# Visualize the difference in confusion matrices
cm_unweighted = confusion_matrix(y_test, y_pred_unweighted)
cm_weighted = confusion_matrix(y_test, y_pred_weighted)

fig = make_subplots(rows=1, cols=2, subplot_titles=('Unweighted', 'Weighted (scale_pos_weight)'))

for col, (cm, title) in enumerate([(cm_unweighted, 'Unweighted'), (cm_weighted, 'Weighted')], 1):
    fig.add_trace(go.Heatmap(
        z=cm,
        x=['Pred Retained', 'Pred Departed'],
        y=['Actual Retained', 'Actual Departed'],
        text=cm,
        texttemplate='%{text}',
        textfont=dict(size=16),
        colorscale='Blues',
        showscale=False
    ), row=1, col=col)

fig.update_layout(
    title='Effect of Class Weighting on Predictions',
    height=400
)

fig.show()

**Interpretation**: Class weighting typically improves **recall** (identifying more actual departures) at the cost of **precision** (more false positives). This trade-off depends on your institutional priorities:
- High recall: Catch more at-risk students (some false alarms)
- High precision: Only intervene when confident (may miss some at-risk students)

## 4. XGBoost in scikit-learn Pipelines

### 4.1 Preprocessing Pipeline

Using pipelines ensures consistent preprocessing between training and prediction, and enables proper cross-validation.

In [None]:
# Define preprocessing for different column types

# Numerical preprocessing: StandardScaler (optional for tree models, but good practice)
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Categorical preprocessing: One-hot encoding
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

# Combine preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

print("Preprocessing pipeline created!")
print(f"  Numerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"  Categorical columns ({len(categorical_cols)}): {categorical_cols}")

### 4.2 Full Pipeline with XGBoost

In [None]:
# Create full pipeline with XGBoost
xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        eval_metric='logloss',
        use_label_encoder=False
    ))
])

# Train the pipeline (using original X_train, not encoded)
xgb_pipeline.fit(X_train, y_train)

# Predict on test set
y_pred_pipeline = xgb_pipeline.predict(X_test)
y_pred_proba_pipeline = xgb_pipeline.predict_proba(X_test)[:, 1]

print("XGBoost Pipeline Performance")
print("=" * 40)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_pipeline):.3f}")
print(f"Precision: {precision_score(y_test, y_pred_pipeline):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred_pipeline):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_pipeline):.3f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_proba_pipeline):.3f}")

In [None]:
# The pipeline can now be used on new data directly
# Simulating a new student
new_student = pd.DataFrame({
    'MATH_PLACEMENT': ['College-Ready'],
    'FIRST_GEN': ['Yes'],
    'PELL_ELIGIBLE': ['Yes'],
    'RESIDENCY': ['In-State'],
    'HS_GPA': [3.2],
    'UNITS_ATTEMPT_1': [15],
    'GPA_1': [2.4],
    'DFW_RATE_1': [0.2],
    'UNITS_ATTEMPT_2': [14],
    'GPA_2': [2.6],
    'DFW_RATE_2': [0.15],
    'CUM_GPA': [2.5],
    'CUM_UNITS': [29],
    'AVG_DFW': [0.175]
})

# Predict using pipeline
departure_prob = xgb_pipeline.predict_proba(new_student)[0, 1]
prediction = 'Departed' if xgb_pipeline.predict(new_student)[0] == 1 else 'Retained'

print("New Student Prediction")
print("=" * 40)
print(f"Predicted outcome: {prediction}")
print(f"Departure probability: {departure_prob:.1%}")

### 4.3 Cross-Validation with Pipelines

In [None]:
# Perform cross-validation with the full pipeline
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Cross-validate on multiple metrics
cv_accuracy = cross_val_score(xgb_pipeline, X_train, y_train, cv=cv, scoring='accuracy')
cv_roc_auc = cross_val_score(xgb_pipeline, X_train, y_train, cv=cv, scoring='roc_auc')
cv_f1 = cross_val_score(xgb_pipeline, X_train, y_train, cv=cv, scoring='f1')

print("5-Fold Cross-Validation Results")
print("=" * 50)
print(f"{'Metric':<12} {'Mean':>10} {'Std':>10} {'Min':>10} {'Max':>10}")
print("-" * 50)
print(f"{'Accuracy':<12} {cv_accuracy.mean():>10.3f} {cv_accuracy.std():>10.3f} {cv_accuracy.min():>10.3f} {cv_accuracy.max():>10.3f}")
print(f"{'ROC-AUC':<12} {cv_roc_auc.mean():>10.3f} {cv_roc_auc.std():>10.3f} {cv_roc_auc.min():>10.3f} {cv_roc_auc.max():>10.3f}")
print(f"{'F1 Score':<12} {cv_f1.mean():>10.3f} {cv_f1.std():>10.3f} {cv_f1.min():>10.3f} {cv_f1.max():>10.3f}")

In [None]:
# Visualize cross-validation results
cv_results = pd.DataFrame({
    'Fold': list(range(1, 6)) * 3,
    'Score': list(cv_accuracy) + list(cv_roc_auc) + list(cv_f1),
    'Metric': ['Accuracy'] * 5 + ['ROC-AUC'] * 5 + ['F1 Score'] * 5
})

fig = px.box(cv_results, x='Metric', y='Score', color='Metric',
             title='Cross-Validation Score Distribution by Metric',
             points='all')

fig.update_layout(height=400, showlegend=False)
fig.show()

## 5. Feature Importance

### 5.1 Built-in Feature Importance

XGBoost provides feature importance scores that indicate how useful each feature was in constructing the boosted trees.

In [None]:
# Get the trained XGBoost classifier from the pipeline
xgb_model = xgb_pipeline.named_steps['classifier']

# Get feature names after preprocessing
# For numerical features, names stay the same
# For categorical features, get the names from the OneHotEncoder
cat_encoder = xgb_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot']
cat_feature_names = cat_encoder.get_feature_names_out(categorical_cols).tolist()

all_feature_names = numerical_cols + cat_feature_names

# Get feature importances (default is 'weight' - number of times feature is used)
importances = xgb_model.feature_importances_

# Create DataFrame for visualization
importance_df = pd.DataFrame({
    'Feature': all_feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=True)

# Plot feature importance
fig = px.bar(importance_df.tail(15), x='Importance', y='Feature', orientation='h',
             title='XGBoost Feature Importance (Top 15)',
             color='Importance', color_continuous_scale='Blues')

fig.update_layout(height=500, yaxis_title='', xaxis_title='Importance Score')
fig.show()

### 5.2 Importance Types

XGBoost supports multiple importance types:

- **weight**: Number of times a feature appears in all trees
- **gain**: Average gain (improvement in accuracy) when the feature is used
- **cover**: Average number of samples affected by splits on the feature

In [None]:
# Get different importance types using the booster
booster = xgb_model.get_booster()

# Get importance scores for each type
importance_types = ['weight', 'gain', 'cover']
importance_data = []

for imp_type in importance_types:
    scores = booster.get_score(importance_type=imp_type)
    # XGBoost uses f0, f1, f2... for feature names
    for i, feat in enumerate(all_feature_names):
        feat_key = f'f{i}'
        score = scores.get(feat_key, 0)
        importance_data.append({
            'Feature': feat,
            'Type': imp_type,
            'Score': score
        })

importance_compare_df = pd.DataFrame(importance_data)

# Pivot for comparison
pivot_df = importance_compare_df.pivot(index='Feature', columns='Type', values='Score').fillna(0)

# Normalize each column to [0, 1] for comparison
pivot_normalized = pivot_df.div(pivot_df.max())

# Get top 10 by gain (usually most informative)
top_features = pivot_df.nlargest(10, 'gain').index.tolist()

# Create comparison plot
fig = go.Figure()

for imp_type in importance_types:
    fig.add_trace(go.Bar(
        name=imp_type.capitalize(),
        x=top_features,
        y=pivot_normalized.loc[top_features, imp_type]
    ))

fig.update_layout(
    title='Feature Importance by Different Metrics (Top 10 by Gain)',
    xaxis_title='Feature',
    yaxis_title='Normalized Importance',
    barmode='group',
    height=450
)

fig.show()

In [None]:
# Display feature importance table
print("Feature Importance Comparison (Top 10 by Gain)")
print("=" * 60)
pivot_df.loc[top_features].round(2)

**Interpretation of Importance Types**:

| Type | Interpretation | Use Case |
|:-----|:---------------|:---------|
| **Weight** | How often feature is used in splits | Identifies frequently used features |
| **Gain** | Average improvement when feature is used | Best for identifying predictive power |
| **Cover** | Number of samples affected | Shows feature reach/coverage |

For understanding which features drive predictions, **gain** is typically most informative. However, for a complete picture of feature importance, we recommend using **SHAP values** (covered in notebook 4.4).

## 6. Summary

In this notebook, we covered:

### Key Concepts

1. **XGBoost Basics**:
   - XGBClassifier provides a scikit-learn compatible interface
   - Built-in regularization (L1, L2) prevents overfitting
   - Handles missing values natively

2. **Key Parameters**:
   - `n_estimators`: Number of boosting rounds
   - `max_depth`: Controls tree complexity
   - `learning_rate`: Shrinkage factor (lower = more regularization)
   - `scale_pos_weight`: Handles class imbalance

3. **Pipeline Integration**:
   - XGBoost works seamlessly with sklearn pipelines
   - Pipelines ensure consistent preprocessing
   - Cross-validation validates the entire workflow

4. **Feature Importance**:
   - Multiple importance types: weight, gain, cover
   - Gain is most informative for predictive power

### Summary Table

| Aspect | Key Point |
|:-------|:----------|
| API | XGBClassifier is sklearn-compatible |
| Regularization | L1 (reg_alpha) and L2 (reg_lambda) |
| Class Imbalance | Use scale_pos_weight |
| Pipeline | Combine with ColumnTransformer |
| Feature Importance | Use 'gain' for predictive power |

### Next Steps

In the next notebook, we will explore **LightGBM** and **CatBoost** as alternatives to XGBoost, each with unique strengths.

**Proceed to:** `4.3 Build LightGBM and CatBoost Models`