# Day 5 Lab 1: SageMaker Setup & Data Preparation
## SecureBank Customer Churn Prediction

**Objective:** Set up SageMaker environment and prepare banking data for ML training

**What You'll Learn:**
- Configure SageMaker session and S3 buckets
- Load and explore banking customer data
- Perform data preprocessing and feature engineering
- Split data into training and validation sets
- Upload prepared data to S3

## Step 1: Import Libraries and Initialize SageMaker

In [None]:
import pandas as pd
import numpy as np
import boto3
import sagemaker
from sagemaker import get_execution_role
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

# Initialize SageMaker session
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'securebank/churn-prediction'
role = get_execution_role()

print(f"SageMaker Session initialized")
print(f"S3 Bucket: {bucket}")
print(f"IAM Role: {role}")

## Step 2: Load Sample Banking Data

We'll use a sample telecom churn dataset as a proxy for banking customer churn.

In [None]:
# Download sample churn dataset
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-examples/main/use-cases/customer_churn/churn.txt

# Load data
df = pd.read_csv('churn.txt')

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## Step 3: Exploratory Data Analysis

In [None]:
# Check data info
print("Dataset Information:")
df.info()

print("\nStatistical Summary:")
df.describe()

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Check target distribution
print("\nChurn Distribution:")
print(df['Churn?'].value_counts())
print(f"\nChurn Rate: {df['Churn?'].value_counts(normalize=True)['True.']:.2%}")

## Step 4: Data Preprocessing

In [None]:
# Convert target to binary (1 for churn, 0 for no churn)
df['Churn'] = (df['Churn?'] == 'True.').astype(int)

# Drop original churn column and phone number
df = df.drop(['Churn?', 'Phone'], axis=1)

# Convert categorical variables to dummy variables
df = pd.get_dummies(df)

# Move target column to first position (XGBoost requirement)
cols = df.columns.tolist()
cols = ['Churn'] + [c for c in cols if c != 'Churn']
df = df[cols]

print(f"Preprocessed data shape: {df.shape}")
print(f"\nFeatures: {df.columns.tolist()[:10]}...")
df.head()

## Step 5: Train/Validation Split

In [None]:
# Split data: 80% train, 20% validation
train_data, val_data = train_test_split(df, test_size=0.2, random_state=42, stratify=df['Churn'])

print(f"Training set: {train_data.shape}")
print(f"Validation set: {val_data.shape}")
print(f"\nTraining set churn rate: {train_data['Churn'].mean():.2%}")
print(f"Validation set churn rate: {val_data['Churn'].mean():.2%}")

## Step 6: Upload Data to S3

In [None]:
# Save to CSV (no header, no index - XGBoost format)
train_data.to_csv('train.csv', header=False, index=False)
val_data.to_csv('validation.csv', header=False, index=False)

# Upload to S3
train_s3_path = sess.upload_data('train.csv', bucket=bucket, key_prefix=f'{prefix}/data/train')
val_s3_path = sess.upload_data('validation.csv', bucket=bucket, key_prefix=f'{prefix}/data/validation')

print(f"✅ Training data uploaded to: {train_s3_path}")
print(f"✅ Validation data uploaded to: {val_s3_path}")

## Summary

**What We Accomplished:**
- ✅ Initialized SageMaker session and S3 bucket
- ✅ Loaded and explored banking customer data
- ✅ Preprocessed data and engineered features
- ✅ Split data into training (80%) and validation (20%) sets
- ✅ Uploaded prepared data to S3 in XGBoost format

**Next Steps:**
- Proceed to Lab 2 to train the XGBoost model
- Configure hyperparameters for optimal performance
- Monitor training job progress