# Day 6 Lab 2: Feature Engineering for Banking ML

**AWS GenAI Banking Workshop**  
**Duration:** 45 minutes  
**Objective:** Create and transform features for banking ML models

---

## What You'll Learn
- Banking-specific feature engineering
- Handling missing data and outliers
- Feature scaling and encoding
- Time-series features for transactions
- Risk indicators and derived metrics

---

## Prerequisites
- Complete Day 6 Lab 1 (SageMaker Studio Setup)
- Have `customer_data.csv` file available

## 1. Environment Setup

In [None]:
# Install packages
import sys
!{sys.executable} -m pip install -q sagemaker boto3 pandas numpy scikit-learn matplotlib seaborn

In [None]:
import boto3
import sagemaker
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully")

In [None]:
# Initialize SageMaker session
try:
    sess = sagemaker.Session()
    bucket = sess.default_bucket()
    role = sagemaker.get_execution_role()
    region = sess.boto_region_name
    print("‚úÖ SageMaker session initialized")
    print(f"üìç Region: {region}")
    print(f"ü™£ Bucket: {bucket}")
except Exception as e:
    print(f"‚ÑπÔ∏è  Running in local mode: {str(e)[:50]}")
    bucket = 'my-sagemaker-bucket'
    region = 'us-east-1'

## 2. Load Banking Data

Load the customer data from Lab 1. We'll try multiple sources:

In [None]:
# Try to load data from multiple sources
df = None

# Option 1: Load from local file (created in Lab 1)
try:
    df = pd.read_csv('customer_data.csv')
    print(f"‚úÖ Loaded {len(df)} customer records from local file")
except FileNotFoundError:
    print("‚ÑπÔ∏è  Local file not found, trying S3...")
    
    # Option 2: Try to load from S3
    try:
        from datetime import datetime
        today = datetime.now().strftime('%Y-%m-%d')
        prefix = f"securebank-ml-project/{today}"
        s3_path = f"s3://{bucket}/{prefix}/data/raw/customer_data.csv"
        df = pd.read_csv(s3_path)
        print(f"‚úÖ Loaded {len(df)} customer records from S3")
    except Exception as e:
        print(f"‚ÑπÔ∏è  S3 load failed: {str(e)[:50]}")
        print("\n‚ö†Ô∏è  Generating sample data instead...")
        
        # Option 3: Generate sample data
        np.random.seed(42)
        n_customers = 1000
        
        df = pd.DataFrame({
            'customer_id': [f'CUST{str(i).zfill(6)}' for i in range(1, n_customers + 1)],
            'age': np.random.randint(18, 80, n_customers),
            'account_balance': np.random.exponential(50000, n_customers),
            'credit_score': np.random.randint(300, 850, n_customers),
            'num_products': np.random.randint(1, 5, n_customers),
            'tenure_months': np.random.randint(1, 240, n_customers),
            'has_credit_card': np.random.choice([0, 1], n_customers),
            'is_active_member': np.random.choice([0, 1], n_customers, p=[0.3, 0.7]),
            'monthly_transactions': np.random.poisson(25, n_customers),
            'churn': np.random.choice([0, 1], n_customers, p=[0.8, 0.2])
        })
        print(f"‚úÖ Generated {len(df)} sample customer records")

# Display sample
print("\nüìä Sample Data:")
df.head()

In [None]:
# Data overview
print("üìà Dataset Info:")
print(f"   Shape: {df.shape}")
print(f"   Columns: {list(df.columns)}")
print(f"   Churn rate: {df['churn'].mean():.1%}")

## 3. Create Banking Features

Let's create derived features that are meaningful for banking:

In [None]:
# Create derived features
print("üîß Creating derived features...\n")

# Financial ratios
df['balance_per_product'] = df['account_balance'] / (df['num_products'] + 1)
df['transactions_per_month'] = df['monthly_transactions'] / (df['tenure_months'] + 1)
df['avg_transaction_value'] = df['account_balance'] / (df['monthly_transactions'] + 1)

# Simulated credit utilization (in real scenario, this would come from credit bureau)
df['credit_utilization'] = np.random.uniform(0, 1, len(df))

# Risk indicators
df['high_risk'] = ((df['credit_score'] < 600) | (df['account_balance'] < 1000)).astype(int)
df['vip_customer'] = ((df['account_balance'] > 100000) & (df['num_products'] >= 3)).astype(int)

# Age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 65, 100], 
                          labels=['18-25', '26-35', '36-50', '51-65', '65+'])

# Engagement score (composite metric)
df['engagement_score'] = (
    (df['is_active_member'] * 0.3) + 
    (df['num_products'] / 4 * 0.3) + 
    (df['monthly_transactions'] / 50 * 0.4)
).clip(0, 1)

print("‚úÖ Created 8 new features")
print("\nüìä Sample of new features:")
df[['customer_id', 'balance_per_product', 'high_risk', 'vip_customer', 'engagement_score']].head(10)

In [None]:
# Feature statistics
print("üìä New Feature Statistics:\n")
new_features = ['balance_per_product', 'transactions_per_month', 'avg_transaction_value', 
                'credit_utilization', 'engagement_score']
df[new_features].describe()

## 4. Handle Missing Data

In [None]:
# Check for missing values
print("üîç Checking for missing values...\n")
missing = df.isnull().sum()

if missing.sum() == 0:
    print("‚úÖ No missing values found!")
else:
    print("Missing values:")
    print(missing[missing > 0])
    
    # Fill missing values with median for numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
    print("\n‚úÖ Missing values handled with median imputation")

## 5. Feature Scaling

Scale numeric features to have mean=0 and std=1:

In [None]:
# Scale numeric features
print("‚öñÔ∏è  Scaling numeric features...\n")

scaler = StandardScaler()
numeric_features = ['age', 'account_balance', 'credit_score', 'tenure_months', 
                    'monthly_transactions', 'balance_per_product', 
                    'transactions_per_month', 'avg_transaction_value', 
                    'credit_utilization', 'engagement_score']

df_scaled = df.copy()
df_scaled[numeric_features] = scaler.fit_transform(df[numeric_features])

print("‚úÖ Features scaled")
print("\nüìä Scaled feature statistics (should have mean‚âà0, std‚âà1):")
df_scaled[numeric_features].describe().loc[['mean', 'std']]

## 6. Feature Visualization

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Feature Engineering Results', fontsize=16, fontweight='bold')

# 1. Risk distribution
risk_counts = df['high_risk'].value_counts()
axes[0,0].bar(['Low Risk', 'High Risk'], risk_counts.values, color=['green', 'red'])
axes[0,0].set_title('Risk Distribution')
axes[0,0].set_ylabel('Count')
axes[0,0].grid(True, alpha=0.3)

# 2. VIP customers
vip_counts = df['vip_customer'].value_counts()
axes[0,1].bar(['Regular', 'VIP'], vip_counts.values, color=['gray', 'gold'])
axes[0,1].set_title('VIP Customer Distribution')
axes[0,1].set_ylabel('Count')
axes[0,1].grid(True, alpha=0.3)

# 3. Engagement score distribution
axes[1,0].hist(df['engagement_score'], bins=30, color='blue', alpha=0.7, edgecolor='black')
axes[1,0].set_title('Engagement Score Distribution')
axes[1,0].set_xlabel('Engagement Score')
axes[1,0].set_ylabel('Count')
axes[1,0].grid(True, alpha=0.3)

# 4. Feature correlation with churn
feature_corr = df[numeric_features + ['churn']].corr()['churn'].drop('churn').sort_values()
feature_corr.plot(kind='barh', ax=axes[1,1], color='purple')
axes[1,1].set_title('Feature Correlation with Churn')
axes[1,1].set_xlabel('Correlation')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úÖ Visualizations complete")

## 7. Save Processed Features

In [None]:
# Save locally
local_file = 'customer_features.csv'
df_scaled.to_csv(local_file, index=False)
print(f"‚úÖ Features saved locally as: {local_file}")

# Try to save to S3 (optional)
try:
    from datetime import datetime
    today = datetime.now().strftime('%Y-%m-%d')
    prefix = f"securebank-ml-project/{today}"
    output_path = f"s3://{bucket}/{prefix}/data/processed/features.csv"
    df_scaled.to_csv(output_path, index=False)
    print(f"‚úÖ Features saved to S3: {output_path}")
except Exception as e:
    print(f"‚ÑπÔ∏è  S3 upload skipped: {str(e)[:50]}")
    print("   Features are available locally")

In [None]:
# Feature summary
feature_cols = [col for col in df_scaled.columns if col not in ['customer_id', 'churn', 'age_group']]

print("\nüìä Feature Engineering Summary:")
print("="*60)
print(f"Total features: {len(feature_cols)}")
print(f"Original features: {len(['age', 'account_balance', 'credit_score', 'num_products', 'tenure_months', 'has_credit_card', 'is_active_member', 'monthly_transactions'])}")
print(f"Derived features: {len(feature_cols) - 8}")
print(f"\nFeature list: {feature_cols}")
print("="*60)

## 8. Next Steps

### Continue with:
1. **Model Training**: Use these features to train ML models
2. **Feature Selection**: Identify most important features
3. **Model Evaluation**: Test model performance

### Key Takeaways:
- ‚úÖ Created 8 derived features from raw data
- ‚úÖ Handled missing values with median imputation
- ‚úÖ Scaled features for ML algorithms
- ‚úÖ Identified risk indicators and VIP customers
- ‚úÖ Calculated engagement scores

### Banking ML Best Practices:
1. **Domain Knowledge**: Use banking expertise to create meaningful features
2. **Feature Engineering**: Often more important than algorithm choice
3. **Scaling**: Essential for distance-based algorithms
4. **Validation**: Always validate features with business stakeholders
5. **Documentation**: Document feature definitions for compliance

## Summary

‚úÖ Loaded banking customer data  
‚úÖ Created 8 derived features  
‚úÖ Handled missing values  
‚úÖ Scaled features for ML  
‚úÖ Visualized feature distributions  
‚úÖ Saved processed features  

**Next**: Model training with SageMaker!

---

**Questions or Issues?**
- Verify data from Lab 1 is available
- Check feature engineering logic
- Validate business rules with stakeholders
- Contact support@greatlearning.com for help