## 0. Install Dependencies

Run this cell once to ensure all required packages are available in the current kernel.

In [6]:
# Install dependencies (run once per kernel)
import subprocess, sys
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q',
    'boto3', 's3fs', 'pandas', 'numpy', 'matplotlib', 'seaborn',
    'scikit-learn', 'xgboost', 'lightgbm', 'sagemaker', 'pyyaml'])


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


0

# Production Promotion — Fraud Detection Model

This notebook demonstrates the complete workflow for evaluating a trained model,
comparing it against a production baseline, and promoting the winning configuration
to the production pipeline.

**Workflow steps:**
1. Train an XGBoost model on the fraud detection dataset
2. Evaluate the model with standard metrics and visualizations
3. Compare against the production baseline
4. Check production quality thresholds
5. Validate and write hyperparameters to Parameter Store
6. Generate and write a production configuration file to S3
7. Trigger the production pipeline for retraining
8. Deploy a challenger endpoint for A/B testing

**Requirements covered:** 6.1–6.5 (Model Evaluation), 8.1–8.3 (Parameter Store),
9.1–9.3 (Configuration Files), 10.1 (Pipeline Trigger), 11.1 (A/B Testing)

## 1. Setup and Imports

In [7]:
import sys
import io
import json

import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier

# Add project src to path
sys.path.insert(0, '../src')
from model_evaluation import ModelEvaluator
from production_integration import ProductionIntegrator
from ab_testing import ABTestingManager
from experiment_tracking import ExperimentTracker

sns.set_theme(style='whitegrid')
%matplotlib inline

print('All modules imported successfully.')

All modules imported successfully.


## 2. Load Data from S3 and Train a Model

Load the processed fraud detection dataset from the `fraud-detection-data-<suffix>` bucket
and train an XGBoost model with the hyperparameters we want to promote.

In [8]:
import os
BUCKET_SUFFIX = os.environ.get('BUCKET_SUFFIX', 'quannh0308-20260222')
BUCKET_NAME = f'fraud-detection-data-{BUCKET_SUFFIX}'
DATA_PREFIX = 'prepared'

# Read partitioned Parquet directories directly from S3
train_df = pd.read_parquet(f's3://{BUCKET_NAME}/{DATA_PREFIX}/train.parquet/')
test_df = pd.read_parquet(f's3://{BUCKET_NAME}/{DATA_PREFIX}/test.parquet/')

TARGET = 'Class'
FEATURES = [c for c in train_df.columns if c != TARGET]

X_train = train_df[FEATURES]
y_train = train_df[TARGET]
X_test = test_df[FEATURES]
y_test = test_df[TARGET]

print(f'Training set:  {X_train.shape[0]:,} rows, {X_train.shape[1]} features')
print(f'Test set:      {X_test.shape[0]:,} rows, {X_test.shape[1]} features')

Training set:  199,824 rows, 30 features
Test set:      42,337 rows, 30 features


### Load Experiment Results

Load the best hyperparameters, algorithm, and feature set from previous notebooks.
These values were saved to `experiment_results.json` by notebooks 02, 03, and 04.

In [9]:
import json
from pathlib import Path

# Load experiment results from previous notebooks
results_path = Path('../experiment_results.json')

if results_path.exists():
    with open(results_path) as f:
        experiment_results = json.load(f)
    
    # Use best hyperparameters from notebook 02
    if 'hyperparameter_tuning' in experiment_results:
        hyperparameters = experiment_results['hyperparameter_tuning']['best_params']
        print(f'Loaded hyperparameters from experiment_results.json (method: {experiment_results["hyperparameter_tuning"]["method"]})')
    else:
        print('No hyperparameter tuning results found, using defaults.')
        hyperparameters = {
            'objective': 'binary:logistic',
            'num_round': 150,
            'max_depth': 7,
            'eta': 0.15,
            'subsample': 0.8,
            'colsample_bytree': 0.8,
        }
    
    # Show algorithm comparison results if available
    if 'algorithm_comparison' in experiment_results:
        best_algo = experiment_results['algorithm_comparison']
        print(f'Best algorithm from comparison: {best_algo["best_algorithm"]} (accuracy={best_algo["best_accuracy"]:.4f})')
    
    # Show feature engineering results if available
    if 'feature_engineering' in experiment_results:
        feat_eng = experiment_results['feature_engineering']
        print(f'Feature selection: {feat_eng["n_features"]} features via {feat_eng["selection_method"]}')
else:
    print('No experiment_results.json found. Using default hyperparameters.')
    hyperparameters = {
        'objective': 'binary:logistic',
        'num_round': 150,
        'max_depth': 7,
        'eta': 0.15,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
    }

print(f'\nHyperparameters for promotion:')
for k, v in hyperparameters.items():
    print(f'  {k}: {v}')

Loaded hyperparameters from experiment_results.json (method: grid_search)
Best algorithm from comparison: RandomForest (accuracy=0.9994)
Feature selection: 20 features via univariate_f_classif

Hyperparameters for promotion:
  objective: binary:logistic
  num_round: 150
  max_depth: 7
  eta: 0.2
  subsample: 0.8
  colsample_bytree: 0.8


In [10]:
# Train XGBoost model with the loaded hyperparameters
model = XGBClassifier(
    objective=hyperparameters['objective'],
    n_estimators=hyperparameters['num_round'],
    max_depth=hyperparameters['max_depth'],
    learning_rate=hyperparameters['eta'],
    subsample=hyperparameters['subsample'],
    colsample_bytree=hyperparameters['colsample_bytree'],
    use_label_encoder=False,
    eval_metric='logloss',
)

model.fit(X_train, y_train)
print('XGBoost model trained successfully.')

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost model trained successfully.


In [11]:
# Generate predictions for evaluation
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

print(f'Predictions generated for {len(y_pred):,} test samples.')

Predictions generated for 42,337 test samples.


## 3. Model Evaluation

Use the `ModelEvaluator` to calculate standard classification metrics and generate
diagnostic visualizations.

**Requirement 6.1**: Calculate accuracy, precision, recall, F1 score, and AUC-ROC  
**Requirement 6.2**: Generate confusion matrices  
**Requirement 6.3**: Generate ROC curves and precision-recall curves

In [12]:
evaluator = ModelEvaluator()

# Calculate all metrics (Req 6.1)
metrics = evaluator.calculate_metrics(y_test, y_pred, y_pred_proba)

print('Model Metrics:')
for name, value in metrics.items():
    print(f'  {name:15s}: {value:.4f}')

Model Metrics:
  accuracy       : 0.9991
  precision      : 0.8108
  recall         : 0.6977
  f1_score       : 0.7500
  auc_roc        : 0.8892


In [13]:
# Confusion matrix (Req 6.2)
cm = evaluator.plot_confusion_matrix(y_test, y_pred, save_path='confusion_matrix.png')

print('Confusion Matrix:')
print(cm)

Confusion Matrix:
[[42237    14]
 [   26    60]]


In [14]:
# ROC curve (Req 6.3)
fpr, tpr, auc_score = evaluator.plot_roc_curve(y_test, y_pred_proba, save_path='roc_curve.png')

print(f'AUC-ROC: {auc_score:.4f}')

AUC-ROC: 0.8892


In [15]:
# Precision-recall curve (Req 6.3)
precision_vals, recall_vals = evaluator.plot_precision_recall_curve(
    y_test, y_pred_proba, save_path='pr_curve.png'
)

print(f'Precision-Recall curve saved to pr_curve.png')

Precision-Recall curve saved to pr_curve.png


## 4. Baseline Comparison

Compare the current model against the production baseline to quantify improvement.

**Requirement 6.4**: Compare experiment results against baseline metrics from production

In [16]:
# Production baseline metrics (from the current deployed model)
baseline_metrics = {
    'accuracy': 0.952,
    'precision': 0.89,
    'recall': 0.85,
    'f1_score': 0.87,
    'auc_roc': 0.94,
}

comparison = evaluator.compare_to_baseline(metrics, baseline_metrics)

print('Comparison to Production Baseline:')
print(f'{"Metric":15s} {"Current":>10s} {"Baseline":>10s} {"Diff":>10s} {"Change":>10s} {"Improved":>10s}')
print('-' * 70)
for metric_name, comp in comparison.items():
    print(
        f'{metric_name:15s} '
        f'{comp["current"]:10.4f} '
        f'{comp["baseline"]:10.4f} '
        f'{comp["difference"]:+10.4f} '
        f'{comp["percent_change"]:+9.2f}% '
        f'{"✓" if comp["improved"] else "✗":>10s}'
    )

Comparison to Production Baseline:
Metric             Current   Baseline       Diff     Change   Improved
----------------------------------------------------------------------
accuracy            0.9991     0.9520    +0.0471     +4.94%          ✓
precision           0.8108     0.8900    -0.0792     -8.90%          ✗
recall              0.6977     0.8500    -0.1523    -17.92%          ✗
f1_score            0.7500     0.8700    -0.1200    -13.79%          ✗
auc_roc             0.8892     0.9400    -0.0508     -5.41%          ✗


## 5. Production Threshold Check

Verify that the model meets the minimum production quality threshold (accuracy >= 0.90).

**Requirement 6.5**: Mark models meeting accuracy >= 0.90 as production-quality

In [17]:
# Full model evaluation with threshold check
eval_results = evaluator.evaluate_model(
    model, X_test, y_test, baseline_metrics=baseline_metrics
)

meets_threshold = eval_results['meets_production_threshold']
accuracy = eval_results['metrics']['accuracy']

if meets_threshold:
    print(f'✓ Model MEETS production threshold (accuracy={accuracy:.4f} >= 0.90)')
    print('  Proceeding with production promotion.')
else:
    print(f'✗ Model DOES NOT meet production threshold (accuracy={accuracy:.4f} < 0.90)')
    print('  Consider further tuning before promoting.')

✓ Model MEETS production threshold (accuracy=0.9991 >= 0.90)
  Proceeding with production promotion.


## 6. Hyperparameter Validation

Before writing to Parameter Store, validate that all hyperparameters have correct
names and values within acceptable ranges.

**Requirement 8.2**: Validate parameter names and value formats before writing

In [18]:
# Initialize production integrator with experiment tracker
tracker = ExperimentTracker(region_name='us-east-1')
integrator = ProductionIntegrator(experiment_tracker=tracker)

# Validate hyperparameters (Req 8.2)
try:
    integrator.validate_hyperparameters(hyperparameters)
    print('✓ Hyperparameters validated successfully.')
    print(f'  Parameters: {list(hyperparameters.keys())}')
except ValueError as e:
    print(f'✗ Validation failed: {e}')

✓ Hyperparameters validated successfully.
  Parameters: ['objective', 'num_round', 'max_depth', 'eta', 'subsample', 'colsample_bytree']


In [19]:
# Demonstrate validation catching invalid parameters
invalid_params = {
    'objective': 'binary:logistic',
    'num_round': 150,
    'max_depth': 25,  # Out of range (max 20)
    'eta': 0.15,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
}

try:
    integrator.validate_hyperparameters(invalid_params)
    print('Validation passed (unexpected).')
except ValueError as e:
    print(f'✓ Validation correctly caught error: {e}')

✓ Validation correctly caught error: Hyperparameter 'max_depth' value 25 out of valid range [1, 20]


## 7. Parameter Store Update

Write the validated hyperparameters to AWS Systems Manager Parameter Store.
A backup of the current values is created automatically before overwriting.

**Requirement 8.1**: Write hyperparameters to Parameter Store paths matching production pipeline  
**Requirement 8.3**: Create a backup of previous values with timestamp

In [20]:
# Write hyperparameters to Parameter Store (Req 8.1, 8.3)
backup_key = integrator.write_hyperparameters_to_parameter_store(hyperparameters)

print(f'\nBackup saved to: {backup_key}')
print('\nParameter Store paths updated:')
for param_name in hyperparameters:
    print(f'  /fraud-detection/hyperparameters/{param_name} = {hyperparameters[param_name]}')


Backup saved to: parameter-store-backups/backup-20260222-234304.yaml

Parameter Store paths updated:
  /fraud-detection/hyperparameters/objective = binary:logistic
  /fraud-detection/hyperparameters/num_round = 150
  /fraud-detection/hyperparameters/max_depth = 7
  /fraud-detection/hyperparameters/eta = 0.2
  /fraud-detection/hyperparameters/subsample = 0.8
  /fraud-detection/hyperparameters/colsample_bytree = 0.8


## 8. Configuration File Generation

Generate a production configuration file in YAML format and write it to S3.
The config includes the algorithm, hyperparameters, performance metrics, test date,
and approver name.

**Requirement 9.1**: Generate production configuration files in YAML format  
**Requirement 9.2**: Include algorithm, hyperparameters, metrics, test date, approver  
**Requirement 9.3**: Write config to `s3://fraud-detection-config/production-model-config.yaml`

In [21]:
EXPERIMENT_ID = 'exp-xgboost-optimized-20240115'
APPROVER = 'data-science-team'

# Generate production config (Req 9.1, 9.2)
config = integrator.generate_production_config(
    experiment_id=EXPERIMENT_ID,
    hyperparameters=hyperparameters,
    metrics=metrics,
    approver=APPROVER,
)

print('Generated production config:')
print(json.dumps(config, indent=2, default=str))

Generated production config:
{
  "model": {
    "algorithm": "xgboost",
    "version": "exp-xgboost-optimized-20240115",
    "hyperparameters": {
      "objective": "binary:logistic",
      "num_round": 150,
      "max_depth": 7,
      "eta": 0.2,
      "subsample": 0.8,
      "colsample_bytree": 0.8
    },
    "performance": {
      "accuracy": 0.999055199943312,
      "precision": 0.8108108108108109,
      "recall": 0.6976744186046512,
      "f1_score": 0.75,
      "auc_roc": 0.88917972493289
    },
    "tested_date": "2026-02-22",
    "approved_by": "data-science-team"
  }
}


In [22]:
# Validate config schema before writing
integrator.validate_config_schema(config)
print('✓ Configuration schema validated.')

✓ Configuration schema validated.


In [23]:
# Write config to S3 (Req 9.3)
integrator.write_config_to_s3(config)

print('\nConfig written to: s3://fraud-detection-config/production-model-config.yaml')
print('Previous config archived with timestamp.')


Config written to: s3://fraud-detection-config/production-model-config.yaml
Previous config archived with timestamp.


## 9. Pipeline Trigger

Trigger the production pipeline (Step Functions) to retrain the model with the
newly promoted hyperparameters.

**Requirement 10.1**: Trigger the production pipeline Step Functions execution

In [24]:
# Trigger production pipeline retraining (Req 10.1)
execution_arn = integrator.trigger_production_pipeline(EXPERIMENT_ID)

print(f'Pipeline execution ARN: {execution_arn}')

Pipeline execution ARN: arn:aws:states:us-east-1:297925986341:execution:FraudDetectionTraining-dev:experiment-exp-xgboost-optimized-20240115-20260222-234307


In [25]:
# Check pipeline status
status = integrator.check_pipeline_status(execution_arn)

print('Pipeline Status:')
for key, value in status.items():
    print(f'  {key}: {value}')

Pipeline Status:
  status: FAILED
  startDate: 2026-02-22 23:43:08.071000+01:00
  stopDate: 2026-02-22 23:43:08.142000+01:00
  output: None


## 10. Full Promotion Workflow (One-Liner)

The `promote_to_production` method orchestrates the entire promotion workflow in a
single call: Parameter Store update, config file generation, S3 write, and optional
pipeline trigger.

In [26]:
# Complete promotion workflow
result = integrator.promote_to_production(
    experiment_id=EXPERIMENT_ID,
    hyperparameters=hyperparameters,
    metrics=metrics,
    approver=APPROVER,
    trigger_pipeline=True,
)

print('\nPromotion Result:')
print(f'  Experiment ID:  {result["promotion_event"]["experiment_id"]}')
print(f'  Timestamp:      {result["promotion_event"]["timestamp"]}')
print(f'  Approver:       {result["promotion_event"]["approver"]}')
print(f'  Backup Key:     {result["promotion_event"]["backup_key"]}')
print(f'  Execution ARN:  {result["execution_arn"]}')


Promotion Result:
  Experiment ID:  exp-xgboost-optimized-20240115
  Timestamp:      2026-02-22T23:43:10.837332
  Approver:       data-science-team
  Backup Key:     parameter-store-backups/backup-20260222-234308.yaml
  Execution ARN:  arn:aws:states:us-east-1:297925986341:execution:FraudDetectionTraining-dev:experiment-exp-xgboost-optimized-20240115-20260222-234310


## 11. A/B Testing Workflow

Deploy a challenger model endpoint alongside the production champion and compare
their performance before fully switching over.

**Requirement 11.1**: Deploy a challenger model endpoint alongside the production champion

> **Note:** The A/B testing section requires a deployed SageMaker endpoint and model artifacts in S3. If you haven't deployed a model yet, skip this section and come back after the training pipeline has completed successfully.

In [27]:
ab_manager = ABTestingManager()

# Deploy challenger endpoint (Req 11.1)
MODEL_DATA_URL = 's3://fraud-detection-models/xgboost/model.tar.gz'

challenger_endpoint = ab_manager.deploy_challenger_endpoint(
    model_data_url=MODEL_DATA_URL,
    experiment_id=EXPERIMENT_ID,
    instance_type='ml.m5.xlarge',
)

print(f'Challenger endpoint deployed: {challenger_endpoint}')

TypeError: ABTestingManager.deploy_challenger_endpoint() got an unexpected keyword argument 'model_data_url'

In [None]:
# Generate traffic split configuration for gradual rollout
CHAMPION_ENDPOINT = 'fraud-detection-production'

traffic_config = ab_manager.generate_traffic_split_config(
    champion_endpoint=CHAMPION_ENDPOINT,
    challenger_endpoint=challenger_endpoint,
)

print('Traffic Split Configuration:')
print(json.dumps(traffic_config, indent=2, default=str))

print('\nRollout Plan:')
for stage in traffic_config.get('rollout_plan', []):
    print(f'  Stage {stage["stage"]}: {stage["challenger_traffic"]}% challenger traffic '
          f'for {stage["duration_hours"]}h')

In [None]:
# Compare champion and challenger endpoints with test data
test_records = X_test.head(100).to_dict(orient='records')

comparison_result = ab_manager.compare_endpoints(
    champion_endpoint=CHAMPION_ENDPOINT,
    challenger_endpoint=challenger_endpoint,
    test_data=test_records,
)

print('Endpoint Comparison:')
print(json.dumps(comparison_result, indent=2, default=str))

In [None]:
# Promote challenger to champion (when A/B test results are positive)
# Uncomment the line below to execute the promotion:
# ab_manager.promote_challenger_to_champion(
#     champion_endpoint=CHAMPION_ENDPOINT,
#     challenger_endpoint=challenger_endpoint,
# )

print('To promote the challenger to champion, uncomment and run the cell above.')
print('This will update the production endpoint and clean up the challenger.')

## 12. Summary and Next Steps

### What We Accomplished

1. **Trained** an XGBoost model with optimized hyperparameters
2. **Evaluated** the model using accuracy, precision, recall, F1, and AUC-ROC
3. **Visualized** results with confusion matrix, ROC curve, and precision-recall curve
4. **Compared** against the production baseline and confirmed improvement
5. **Validated** hyperparameters before promotion
6. **Updated** Parameter Store with new hyperparameters (with automatic backup)
7. **Generated** a production configuration file and wrote it to S3
8. **Triggered** the production pipeline for retraining
9. **Deployed** a challenger endpoint for A/B testing

### Next Steps

- **Monitor the A/B test** — track champion vs challenger metrics over the rollout
  stages (1% → 10% → 50% → 100%).
- **Promote the challenger** — once the challenger consistently outperforms the
  champion, call `promote_challenger_to_champion()` to complete the switch.
- **Iterate** — use the other notebooks to explore new features, algorithms, or
  hyperparameter configurations and repeat this promotion workflow.
- **Rollback if needed** — use `integrator.rollback_parameter_store(backup_key)` to
  restore previous Parameter Store values if issues arise.