# XGBoost RPC — Report Template

This notebook is a parameterized template for training an XGBoost classifier to predict RPC (or rpc_label) from the dataset.
It includes data description, preprocessing, training, evaluation and feature importance. Run with papermill to produce an HTML report.

Note: This notebook is using the fresh_venv Python environment.

In [None]:
# Parameters
import os
from pathlib import Path

workspace_root = Path(os.getcwd()).resolve()
if workspace_root.name == 'notebooks':
    workspace_root = workspace_root.parent

data_path = '../data/synthetic_callcenter_accounts.csv'  # overridden by papermill
out_dir = str(workspace_root / 'ml_models')
test_size = 0.2
random_state = 42

print(f"Output directory set to: {out_dir}")

In [None]:
# Imports
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='xgboost')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder
%matplotlib inline

In [None]:
# Load data and quick overview

df = pd.read_csv(data_path)
print('Data path:', data_path)
print('Rows, columns:', df.shape)
display(df.head())
display(df.describe(include='all').T)

## Data description and ML context

- This section clarifies the target (rpc_label or rpc) and features used.
- XGBoost is a powerful gradient-boosted trees algorithm suited for tabular data with mixed feature types.
- Proper preprocessing (handling missing, encoding categoricals, feature scaling if needed) is important.

In [None]:
# Preprocessing (lightweight): ensure target exists and encode categorical features
if 'rpc_label' in df.columns:
    df['rpc_label'] = df['rpc_label'].astype(int)
elif 'rpc' in df.columns:
    df['rpc_label'] = df['rpc'].astype(int)
else:
    raise ValueError('No rpc_label or rpc target found in data for XGBoost.')

# Fill numeric NAs with median
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
if 'rpc_label' in num_cols:
    num_cols.remove('rpc_label')
for c in num_cols:
    df[c] = df[c].fillna(df[c].median())

# Convert object columns to string and fillna
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
for c in cat_cols:
    df[c] = df[c].fillna('missing').astype(str)

# One-hot encode small-cardinality categoricals for simplicity (can be improved)
from sklearn.preprocessing import OneHotEncoder
if len(cat_cols) > 0:
    ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
    X_cat = ohe.fit_transform(df[cat_cols])
    cat_names = ohe.get_feature_names_out(cat_cols)
    X_cat_df = pd.DataFrame(X_cat, columns=cat_names, index=df.index)
    df = pd.concat([df.drop(columns=cat_cols), X_cat_df], axis=1)

In [None]:
# Prepare X and y, split, and run a small grid search
y = df['rpc_label']
X = df.drop(columns=['rpc_label'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, random_state=random_state)
model = xgb.XGBClassifier(eval_metric='logloss')
param_grid = {'n_estimators':[50,100], 'max_depth':[3,5], 'learning_rate':[0.05,0.1]}
grid = GridSearchCV(model, param_grid, scoring='roc_auc', cv=3, n_jobs=1, verbose=0)  # Changed verbose from 1 to 0
grid.fit(X_train, y_train)
best = grid.best_estimator_
y_proba = best.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, y_proba)
y_pred = best.predict(X_test)
print('Best params:', grid.best_params_)
print('AUC:', auc)
display(classification_report(y_test, y_pred))
display(confusion_matrix(y_test, y_pred))

In [None]:
# Feature importance)
fi = best.feature_importances_
feat_names = X.columns
fi_df = pd.DataFrame({'feature':feat_names, 'importance':fi}).sort_values('importance', ascending=False).head(20)
plt.figure(figsize=(8,6))
plt.barh(fi_df['feature'][::-1], fi_df['importance'][::-1])
plt.title('Top features (approx)')
plt.show()

## Notes on model and importance

- XGBoost is a gradient boosting ensemble of decision trees and typically performs very well on tabular data.
- Pay attention to class imbalance, feature engineering, and proper CV when tuning.
- For portfolio narrative: describe why certain features are important and how the model may be used in production (e.g., ranking, routing).

## Generate HTML Report

The following cell will generate an HTML report from this notebook and save it in the `ml_models` directory.

In [None]:
# Generate HTML report
import os
import sys
from pathlib import Path
import nbformat
from nbconvert import HTMLExporter
from datetime import datetime
import warnings

try:
    warnings.filterwarnings('ignore', category=UserWarning)
    
    notebook_name = 'xgboost_template.ipynb'
    notebook_path = str(workspace_root/'notebooks'/notebook_name)
    
    print(f"Found notebook at: {notebook_path}")
    print(f"Output directory: {out_dir}")
    
    os.makedirs(out_dir, exist_ok=True)
    
    with open(notebook_path, 'r', encoding='utf-8') as f:
        current = nbformat.read(f, as_version=4)

    html_exporter = HTMLExporter()
    html_exporter.template_name = 'classic'
    
    body, _ = html_exporter.from_notebook_node(current)

    output_file = os.path.join(out_dir, 'xgboost_report.html')
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(body)

    print(f"Report saved to: {output_file}")
except Exception as e:
    print(f"Current working directory: {os.getcwd()}")
    print(f"Error: {str(e)}")
    
    print("\nDebug information:")
    print(f"Workspace root: {workspace_root}")
    print(f"Notebook path: {notebook_path}")
    print(f"Output directory: {out_dir}")