# Day 5 Lab 2: Model Training with XGBoost
## SecureBank Customer Churn Prediction - Training

**Objective:** Train an XGBoost classification model for customer churn prediction

**What You'll Learn:**
- Configure XGBoost algorithm in SageMaker
- Set hyperparameters for banking use case
- Launch and monitor training jobs
- Evaluate model performance
- Save model artifacts

## Step 1: Initialize SageMaker and Load Data Paths

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator
import time

# Initialize
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'securebank/churn-prediction'
role = get_execution_role()
region = boto3.Session().region_name

# Data paths from Lab 1
train_path = f's3://{bucket}/{prefix}/data/train'
val_path = f's3://{bucket}/{prefix}/data/validation'
output_path = f's3://{bucket}/{prefix}/models'

print(f"Training data: {train_path}")
print(f"Validation data: {val_path}")
print(f"Model output: {output_path}")

## Step 2: Verify Training Data Exists

In [None]:
# Verify data exists in S3
import boto3
s3 = boto3.client('s3')

def check_s3_path(s3_path):
    """Check if S3 path has objects"""
    bucket_name = s3_path.split('/')[2]
    prefix_path = '/'.join(s3_path.split('/')[3:])
    
    try:
        response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix_path, MaxKeys=1)
        return 'Contents' in response and len(response['Contents']) > 0
    except Exception as e:
        return False

print("üîç Verifying training data...\n")

train_exists = check_s3_path(train_path)
val_exists = check_s3_path(val_path)

if train_exists:
    print(f"‚úÖ Training data found: {train_path}")
else:
    print(f"‚ùå Training data NOT found: {train_path}")
    print("   Please run Lab 1 first to prepare the data!")

if val_exists:
    print(f"‚úÖ Validation data found: {val_path}")
else:
    print(f"‚ùå Validation data NOT found: {val_path}")
    print("   Please run Lab 1 first to prepare the data!")

if not (train_exists and val_exists):
    raise FileNotFoundError("Training data not found. Please complete Lab 1 first!")

print("\n‚úÖ All data verified and ready for training!")

## Step 3: Configure XGBoost Algorithm

In [None]:
# Get XGBoost container image
from sagemaker import image_uris
container = image_uris.retrieve('xgboost', region, '1.5-1')

print(f"XGBoost container: {container}")

## Step 3: Set Hyperparameters for Banking Churn Prediction

In [None]:
# Hyperparameters optimized for binary classification
hyperparameters = {
    'objective': 'binary:logistic',  # Binary classification
    'num_round': 100,                # Number of boosting rounds
    'max_depth': 5,                  # Maximum tree depth
    'eta': 0.2,                      # Learning rate
    'subsample': 0.8,                # Row sampling ratio
    'colsample_bytree': 0.8,         # Column sampling ratio
    'min_child_weight': 3,           # Minimum sum of instance weight
    'eval_metric': 'auc',            # Evaluation metric (AUC for imbalanced data)
    'scale_pos_weight': 1            # Balance of positive/negative weights
}

print("Hyperparameters:")
for key, value in hyperparameters.items():
    print(f"  {key}: {value}")

## Step 4: Create SageMaker Estimator

In [None]:
# Create estimator
xgb_estimator = Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',  # Cost-optimized: $0.115/hr (50% cheaper than xlarge)
    output_path=output_path,
    sagemaker_session=sess,
    hyperparameters=hyperparameters
)

print("‚úÖ XGBoost Estimator created")
print(f"   Instance type: ml.m5.large (cost-optimized)")
print(f"   Instance count: 1")

## Step 5: Start Training Job

In [None]:
# Define input data channels
train_input = TrainingInput(train_path, content_type='text/csv')
val_input = TrainingInput(val_path, content_type='text/csv')

# Start training
print("üöÄ Starting training job...")
print("   This will take approximately 5-10 minutes")
print("   You can monitor progress in the SageMaker console\n")

xgb_estimator.fit({
    'train': train_input,
    'validation': val_input
})

print("\n‚úÖ Training completed!")

## Step 6: Retrieve Training Metrics

In [None]:
# Get training job name
training_job_name = xgb_estimator.latest_training_job.name
print(f"Training Job: {training_job_name}")

# Get model artifacts location
model_data = xgb_estimator.model_data
print(f"\nModel Artifacts: {model_data}")

# Display training metrics
print("\nüìä Training Metrics:")
print("   Check CloudWatch Logs for detailed metrics")
print(f"   Log Group: /aws/sagemaker/TrainingJobs")
print(f"   Log Stream: {training_job_name}/algo-1-*")

## Step 7: Save Model Information for Deployment

In [None]:
# Save model data path for Lab 3
with open('model_data_path.txt', 'w') as f:
    f.write(model_data)

print("‚úÖ Model information saved")
print(f"\nModel ready for deployment in Lab 3!")
print(f"Model S3 Path: {model_data}")

## Summary

**What We Accomplished:**
- ‚úÖ Configured XGBoost algorithm with banking-optimized hyperparameters
- ‚úÖ Created SageMaker Estimator with ml.t3.medium instance (cost-optimized)
- ‚úÖ Launched training job with train and validation data
- ‚úÖ Monitored training progress and completion
- ‚úÖ Saved model artifacts to S3

**Model Performance:**
- Training completed successfully
- Model artifacts saved to S3
- Ready for deployment to real-time endpoint

**Next Steps:**
- Proceed to Lab 3 for model deployment
- Create real-time inference endpoint
- Test predictions with sample data