# Day 6 Lab 1: SageMaker Studio Setup & Exploration

**AWS GenAI Banking Workshop**  
**Duration:** 30 minutes  
**Objective:** Set up SageMaker Studio and explore its features for banking ML workflows

---

## What You'll Learn
- Set up SageMaker Studio environment
- Explore Studio UI and features
- Create and manage notebooks
- Generate sample banking data
- Banking-specific ML project structure

---

## 1. Environment Setup

First, let's install the required packages and verify our environment:

In [None]:
# Install required packages (run this first!)
import sys
!{sys.executable} -m pip install -q sagemaker boto3 pandas numpy matplotlib seaborn scikit-learn

In [None]:
# Import libraries
import boto3
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")

## 2. Initialize SageMaker Session

Let's connect to SageMaker and get our environment details:

In [None]:
# Initialize SageMaker session
try:
    sess = sagemaker.Session()
    role = get_execution_role()
    region = sess.boto_region_name
    bucket = sess.default_bucket()
    
    print("‚úÖ SageMaker Session initialized")
    print(f"üìç Region: {region}")
    print(f"ü™£ S3 Bucket: {bucket}")
    print(f"üîê IAM Role: {role[:50]}...")
except Exception as e:
    print(f"‚ö†Ô∏è  Error initializing SageMaker: {str(e)}")
    print("   This is normal if running outside SageMaker Studio")
    # Set defaults for local testing
    region = 'us-east-1'
    bucket = 'my-sagemaker-bucket'
    role = 'arn:aws:iam::123456789012:role/SageMakerRole'

## 3. Verify Studio Environment

Check if we're running in SageMaker Studio and what resources are available:

In [None]:
# Check Studio domain information
sm_client = boto3.client('sagemaker', region_name=region)

try:
    # List Studio domains
    domains = sm_client.list_domains()
    
    if domains['Domains']:
        domain_id = domains['Domains'][0]['DomainId']
        domain_info = sm_client.describe_domain(DomainId=domain_id)
        
        print("‚úÖ SageMaker Studio Domain Found")
        print(f"   Domain ID: {domain_id}")
        print(f"   Domain Name: {domain_info['DomainName']}")
        print(f"   Status: {domain_info['Status']}")
        print(f"   Auth Mode: {domain_info['AuthMode']}")
    else:
        print("‚ÑπÔ∏è  No Studio domain found. You may be using a notebook instance.")
except Exception as e:
    print(f"‚ÑπÔ∏è  Running in notebook instance or local mode")
    print(f"   Details: {str(e)[:100]}")

## 4. Generate Sample Banking Dataset

Let's create a realistic banking customer dataset for our ML experiments:

In [None]:
# Generate sample banking customer data
np.random.seed(42)
n_customers = 1000

print("üè¶ Generating sample banking customer data...\n")

banking_data = pd.DataFrame({
    'customer_id': [f'CUST{str(i).zfill(6)}' for i in range(1, n_customers + 1)],
    'age': np.random.randint(18, 80, n_customers),
    'account_balance': np.random.exponential(50000, n_customers),
    'credit_score': np.random.randint(300, 850, n_customers),
    'num_products': np.random.randint(1, 5, n_customers),
    'tenure_months': np.random.randint(1, 240, n_customers),
    'has_credit_card': np.random.choice([0, 1], n_customers),
    'is_active_member': np.random.choice([0, 1], n_customers, p=[0.3, 0.7]),
    'monthly_transactions': np.random.poisson(25, n_customers),
    'churn': np.random.choice([0, 1], n_customers, p=[0.8, 0.2])
})

# Display sample
print("üìä Sample Banking Customer Data:")
print(banking_data.head(10))
print(f"\nüìà Dataset shape: {banking_data.shape}")
print(f"üìâ Churn rate: {banking_data['churn'].mean():.1%}")
print(f"üí∞ Average balance: ${banking_data['account_balance'].mean():,.2f}")
print(f"üìä Average credit score: {banking_data['credit_score'].mean():.0f}")

## 5. Data Exploration

Let's explore the dataset with some basic statistics and visualizations:

In [None]:
# Statistical summary
print("üìä Statistical Summary:\n")
banking_data.describe()

In [None]:
# Check for missing values
print("üîç Missing Values Check:\n")
missing = banking_data.isnull().sum()
if missing.sum() == 0:
    print("‚úÖ No missing values found!")
else:
    print(missing[missing > 0])

In [None]:
# Visualize churn distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Banking Customer Analysis', fontsize=16, fontweight='bold')

# 1. Age distribution by churn
for churn_val in [0, 1]:
    data = banking_data[banking_data['churn'] == churn_val]['age']
    axes[0,0].hist(data, alpha=0.6, bins=20, label=f'Churn={churn_val}')
axes[0,0].set_title('Age Distribution by Churn')
axes[0,0].set_xlabel('Age')
axes[0,0].set_ylabel('Count')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# 2. Account balance by churn
banking_data.boxplot(column='account_balance', by='churn', ax=axes[0,1])
axes[0,1].set_title('Account Balance by Churn')
axes[0,1].set_xlabel('Churn (0=No, 1=Yes)')
axes[0,1].set_ylabel('Account Balance ($)')
plt.sca(axes[0,1])
plt.xticks([1, 2], ['No Churn', 'Churn'])

# 3. Churn by number of products
churn_by_products = banking_data.groupby(['num_products', 'churn']).size().unstack(fill_value=0)
churn_by_products.plot(kind='bar', ax=axes[1,0], color=['green', 'red'])
axes[1,0].set_title('Churn by Number of Products')
axes[1,0].set_xlabel('Number of Products')
axes[1,0].set_ylabel('Count')
axes[1,0].legend(['No Churn', 'Churn'])
axes[1,0].grid(True, alpha=0.3)
plt.sca(axes[1,0])
plt.xticks(rotation=0)

# 4. Correlation heatmap
corr = banking_data.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', ax=axes[1,1], 
            cbar_kws={'label': 'Correlation'})
axes[1,1].set_title('Feature Correlation Heatmap')

plt.tight_layout()
plt.show()

print("‚úÖ Visualizations complete")

## 6. Save Dataset

Save the dataset locally and optionally to S3:

In [None]:
# Save locally
local_file = 'customer_data.csv'
banking_data.to_csv(local_file, index=False)
print(f"‚úÖ Data saved locally as: {local_file}")

# Try to save to S3 (optional)
try:
    project_name = "securebank-ml-project"
    prefix = f"{project_name}/{datetime.now().strftime('%Y-%m-%d')}"
    s3_path = f"s3://{bucket}/{prefix}/data/raw/customer_data.csv"
    
    banking_data.to_csv(s3_path, index=False)
    print(f"‚úÖ Data uploaded to S3: {s3_path}")
except Exception as e:
    print(f"‚ÑπÔ∏è  S3 upload skipped: {str(e)[:100]}")
    print("   Data is available locally")

## 7. Banking ML Use Cases

Common ML use cases in banking that we can explore with this data:

In [None]:
use_cases = {
    "Customer Churn Prediction": {
        "type": "Classification",
        "algorithm": "XGBoost",
        "business_value": "Reduce customer attrition by 15-20%",
        "data_required": "Transaction history, demographics, product usage"
    },
    "Credit Risk Assessment": {
        "type": "Classification",
        "algorithm": "Random Forest",
        "business_value": "Reduce default rate by 10-15%",
        "data_required": "Credit history, income, employment, debt-to-income"
    },
    "Fraud Detection": {
        "type": "Anomaly Detection",
        "algorithm": "Isolation Forest / AutoEncoder",
        "business_value": "Prevent $1M+ in fraud losses annually",
        "data_required": "Transaction patterns, device info, location data"
    },
    "Loan Default Prediction": {
        "type": "Classification",
        "algorithm": "Gradient Boosting",
        "business_value": "Improve loan portfolio quality by 20%",
        "data_required": "Loan history, payment behavior, financial ratios"
    },
    "Customer Lifetime Value": {
        "type": "Regression",
        "algorithm": "Linear Regression / XGBoost",
        "business_value": "Optimize marketing spend by 25%",
        "data_required": "Transaction history, product adoption, engagement"
    }
}

print("üè¶ Banking ML Use Cases:\n")
print("="*80)
for use_case, details in use_cases.items():
    print(f"\nüìä {use_case}")
    print(f"   Type: {details['type']}")
    print(f"   Algorithm: {details['algorithm']}")
    print(f"   Business Value: {details['business_value']}")
    print(f"   Data Required: {details['data_required']}")
print("\n" + "="*80)

## 8. SageMaker Studio Features

### Available Features in SageMaker Studio:

1. **Notebooks**: Interactive development (you're using one now!)
2. **Experiments**: Track and compare ML experiments
3. **Pipelines**: Automate ML workflows
4. **Model Registry**: Version and manage models
5. **Feature Store**: Centralize feature management
6. **Data Wrangler**: Visual data preparation
7. **Autopilot**: AutoML capabilities
8. **Debugger**: Debug training jobs
9. **Model Monitor**: Monitor deployed models

### Explore these features:
- Click the **SageMaker resources** icon (left sidebar)
- Browse **Experiments and trials**
- Check **Model registry**
- View **Pipelines**

In [None]:
# List available SageMaker resources
print("üîç Checking available SageMaker resources...\n")

# Check for experiments
try:
    experiments = sm_client.list_experiments(MaxResults=5)
    print(f"‚úÖ Experiments: {len(experiments.get('ExperimentSummaries', []))} found")
except Exception as e:
    print(f"‚ÑπÔ∏è  Experiments: {str(e)[:50]}")

# Check for pipelines
try:
    pipelines = sm_client.list_pipelines(MaxResults=5)
    print(f"‚úÖ Pipelines: {len(pipelines.get('PipelineSummaries', []))} found")
except Exception as e:
    print(f"‚ÑπÔ∏è  Pipelines: {str(e)[:50]}")

# Check for models
try:
    models = sm_client.list_models(MaxResults=5)
    print(f"‚úÖ Models: {len(models.get('Models', []))} found")
except Exception as e:
    print(f"‚ÑπÔ∏è  Models: {str(e)[:50]}")

# Check for endpoints
try:
    endpoints = sm_client.list_endpoints(MaxResults=5)
    print(f"‚úÖ Endpoints: {len(endpoints.get('Endpoints', []))} found")
except Exception as e:
    print(f"‚ÑπÔ∏è  Endpoints: {str(e)[:50]}")

## 9. Next Steps

### Continue with:
1. **Day 6 Lab 2**: Feature Engineering for Banking (`day6_feature_engineering.ipynb`)
2. **Day 6 Lab 3**: Model Training with SageMaker Pipelines

### Resources:
- [SageMaker Studio Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html)
- [SageMaker Examples](https://github.com/aws/amazon-sagemaker-examples)
- [Banking ML Best Practices](https://aws.amazon.com/financial-services/machine-learning/)

### Tips:
- Save your work frequently (File ‚Üí Save Notebook)
- Use keyboard shortcuts: Shift+Enter to run cell
- Explore the SageMaker resources panel on the left
- Check CloudWatch logs for debugging

## Summary

‚úÖ SageMaker Studio environment verified  
‚úÖ Sample banking customer dataset generated (1,000 records)  
‚úÖ Data exploration and visualization completed  
‚úÖ Dataset saved locally and to S3  
‚úÖ Banking ML use cases identified  
‚úÖ SageMaker features explored  

**Next**: Feature engineering and model training!

---

**Questions or Issues?**
- Check the error messages carefully
- Verify IAM permissions for SageMaker
- Ensure you're running in a SageMaker environment
- Contact support@greatlearning.com for help