# End-to-End Customer Churn Prediction on AWS

This case study demonstrates a complete workflow for predicting customer churn using AWS services. We'll use a telecommunications customer dataset to predict which customers are likely to cancel their service.

## Prerequisites

- An AWS account with appropriate permissions
- AWS CLI configured with your credentials
- Python 3.7 or later
- Required Python packages: boto3, sagemaker, pandas, numpy, matplotlib, seaborn

Install required packages:

In [None]:
%%bash
pip install boto3 sagemaker pandas numpy matplotlib seaborn

## Step 1: Data Preparation

First, we'll create a sample dataset and upload it to S3. In a real-world scenario, you would replace this with your own data.

In [None]:
import pandas as pd
import numpy as np
import boto3
import io

# Create sample data
np.random.seed(42)
n_customers = 1000

data = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'tenure': np.random.randint(1, 72, n_customers),
    'monthly_charges': np.random.uniform(20, 100, n_customers),
    'total_charges': np.random.uniform(100, 5000, n_customers),
    'contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_customers),
    'online_security': np.random.choice(['No', 'Yes'], n_customers),
    'tech_support': np.random.choice(['No', 'Yes'], n_customers),
    'streaming_tv': np.random.choice(['No', 'Yes'], n_customers),
    'streaming_movies': np.random.choice(['No', 'Yes'], n_customers),
    'churn': np.random.choice([0, 1], n_customers, p=[0.7, 0.3])  # 30% churn rate
})

# Calculate total_charges based on tenure and monthly_charges
data['total_charges'] = data['tenure'] * data['monthly_charges']

# Upload to S3
bucket_name = 'your-bucket-name'  # Replace with your S3 bucket name
key = 'churn_data/telco_churn.csv'

s3 = boto3.client('s3')
csv_buffer = io.StringIO()
data.to_csv(csv_buffer, index=False)
s3.put_object(Bucket=bucket_name, Key=key, Body=csv_buffer.getvalue())

print(f"Data uploaded to s3://{bucket_name}/{key}")

# To use your own data, comment out the code above and use:
# s3.upload_file('path/to/your/data.csv', bucket_name, key)

## Step 2: Data Exploration and Preprocessing

We'll use SageMaker to explore and preprocess our data.

In [None]:
import sagemaker
from sagemaker.session import Session
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

sagemaker_session = Session()
role = sagemaker.get_execution_role()

# Create a ScriptProcessor
processor = ScriptProcessor(
    command=['python3'],
    image_uri=sagemaker.image_uris.retrieve(
        framework="sklearn",
        region=sagemaker_session.boto_region_name,
        version="0.23-1"),
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge'
)

# Run the processing job
processor.run(
    code='preprocess.py',
    inputs=[ProcessingInput(
        source=f's3://{bucket_name}/{key}',
        destination='/opt/ml/processing/input'
    )],
    outputs=[
        ProcessingOutput(output_name='train', source='/opt/ml/processing/train'),
        ProcessingOutput(output_name='validation', source='/opt/ml/processing/validation'),
        ProcessingOutput(output_name='test', source='/opt/ml/processing/test')
    ],
    arguments=['--input-data', '/opt/ml/processing/input/telco_churn.csv']
)

print("Data preprocessing completed.")

Create a file named `preprocess.py` with the following content:

In [None]:
import argparse
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def preprocess_telco_data(df):
    # Convert categorical variables to numeric
    df['contract'] = pd.Categorical(df['contract']).codes
    df['online_security'] = pd.Categorical(df['online_security']).codes
    df['tech_support'] = pd.Categorical(df['tech_support']).codes
    df['streaming_tv'] = pd.Categorical(df['streaming_tv']).codes
    df['streaming_movies'] = pd.Categorical(df['streaming_movies']).codes

    # Separate features and target
    X = df.drop(['customer_id', 'churn'], axis=1)
    y = df['churn']

    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    return pd.DataFrame(X_scaled, columns=X.columns), y

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--input-data', type=str, required=True)
    args = parser.parse_args()

    # Read input data
    input_data = pd.read_csv(args.input_data)
    print('Shape of input data:', input_data.shape)

    # Preprocess data
    X, y = preprocess_telco_data(input_data)
    print('Shape of preprocessed features:', X.shape)

    # Split data into train, validation, and test sets
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

    # Save preprocessed datasets
    pd.concat([X_train, y_train], axis=1).to_csv('/opt/ml/processing/train/train.csv', index=False)
    pd.concat([X_val, y_val], axis=1).to_csv('/opt/ml/processing/validation/validation.csv', index=False)
    pd.concat([X_test, y_test], axis=1).to_csv('/opt/ml/processing/test/test.csv', index=False)

    print('Preprocessing completed.')

## Step 3: Model Training

We'll use SageMaker's built-in XGBoost algorithm to train our model.

In [None]:
from sagemaker.xgboost import XGBoost

# Set up the estimator
xgb = XGBoost(
    entry_point='train.py',
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    framework_version='1.0-1',
    output_path=f's3://{bucket_name}/model_output',
    hyperparameters={
        'max_depth': 5,
        'eta': 0.2,
        'gamma': 4,
        'min_child_weight': 6,
        'subsample': 0.8,
        'objective': 'binary:logistic',
        'num_round': 100
    }
)

# Train the model
xgb.fit({
    'train': f's3://{bucket_name}/churn_data/train',
    'validation': f's3://{bucket_name}/churn_data/validation'
})

print("Model training completed.")

Create a file named `train.py` with the following content:

In [None]:
import argparse
import os
import pandas as pd
import xgboost as xgb

if __name__ =='__main__':
    parser = argparse.ArgumentParser()
    
    # Hyperparameters are described here
    parser.add_argument('--num_round', type=int, default=999)
    parser.add_argument('--max_depth', type=int, default=3)
    parser.add_argument('--eta', type=float, default=0.1)
    
    # SageMaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))
    
    args = parser.parse_args()
    
    train = pd.read_csv(f'{args.train}/train.csv')
    validation = pd.read_csv(f'{args.validation}/validation.csv')
    
    X_train = train.drop('churn', axis=1)
    y_train = train['churn']
    X_validation = validation.drop('churn', axis=1)
    y_validation = validation['churn']
    
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dvalidation = xgb.DMatrix(X_validation, label=y_validation)
    
    params = {
        'max_depth': args.max_depth,
        'eta': args.eta,
        'objective': 'binary:logistic'
    }
    
    model = xgb.train(params, dtrain, args.num_round, evals=[(dvalidation, 'validation')])
    
    model.save_model(f'{args.model_dir}/xgboost-model')

## Step 4: Model Deployment

Now, let's deploy our trained model to a SageMaker endpoint.

In [None]:
predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

print(f"Model deployed. Endpoint name: {predictor.endpoint_name}")

## Step 5: Inference and Evaluation

We'll use our deployed model to make predictions on the test set and evaluate its performance.

In [None]:
import io
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load test data
test_data = pd.read_csv(f's3://{bucket_name}/churn_data/test/test.csv')
X_test = test_data.drop('churn', axis=1)
y_test = test_data['churn']

# Make predictions
predictions = predictor.predict(X_test.values)

# Convert raw predictions to binary predictions
y_pred = (predictions > 0.5).astype(int)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, predictions)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"AUC-ROC: {auc_roc:.4f}")

# Visualize feature importance
feature_importance = predictor.predict(X_test.values, initial_args={'pred_contribs': 'True'})
feature_names = X_test.columns

plt.figure(figsize=(10, 6))
sns.barplot(x=np.mean(feature_importance, axis=0), y=feature_names)
plt.title('Feature Importance')
plt.xlabel('Mean SHAP Value')
plt.tight_layout()

# Save plot to S3
img_data = io.BytesIO()
plt.savefig(img_data, format='png')
img_data.seek(0)
s3.put_object(Body=img_data, Bucket=bucket_name, Key='churn_prediction/feature_importance.png')

print(f"Feature importance plot saved to s3://{bucket_name}/churn_prediction/feature_importance.png")

# Clean up
predictor.delete_endpoint()

## Conclusion

This end-to-end example demonstrates how to:

1. Prepare and upload data to S3
2. Preprocess data using SageMaker Processing
3. Train a model using SageMaker's built-in XGBoost algorithm
4. Deploy the trained model to a SageMaker endpoint
5. Make predictions and evaluate the model's performance
6. Visualize feature importance and save the plot to S3

Key points to remember:

- Replace the sample data with your own dataset for real-world applications
- Adjust hyperparameters and model architecture based on your specific use case
- Consider using SageMaker Experiments to track multiple training runs
- Implement proper error handling and logging for production environments
- Set up monitoring for the deployed model to track its performance over time

By following this workflow, you can build and deploy machine learning models on AWS for various business problems, leveraging the scalability and managed services provided by the AWS ecosystem.