# Day 6 Lab 2: Deploying & Using Existing Endpoints

## üéØ Learning Objectives
- Deploy a pre-trained model endpoint
- Invoke endpoint from Python, CLI, Lambda
- Implement error handling and retry logic
- Monitor endpoint metrics

## üè¶ Banking Use Case
Deploy a **fraud detection model** and integrate it with multiple banking applications.

## ‚è±Ô∏è Duration: 35 minutes
## üí∞ Cost: ~$0.12

## Setup

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn import SKLearnModel
import json
import time
from botocore.exceptions import ClientError

# Initialize
session = sagemaker.Session()
role = get_execution_role()
region = session.boto_region_name
bucket = session.default_bucket()
runtime = boto3.client('sagemaker-runtime', region_name=region)

print(f"Region: {region}")
print(f"Bucket: {bucket}")

## Part 1: Understanding SageMaker Endpoints

**Key Concept:** An endpoint is a hosted model that can be invoked for real-time predictions.

In [None]:
# Endpoint deployment process
print("üìä SageMaker Endpoint Deployment Process:\n")
print("Step 1: Create Model")
print("  - Upload model artifacts to S3")
print("  - Specify container image (framework)")
print("  - Define IAM role")
print("\nStep 2: Create Endpoint Configuration")
print("  - Choose instance type (ml.t3.medium, ml.m5.large, etc.)")
print("  - Set instance count (1 for dev, 2+ for production)")
print("  - Configure auto-scaling (optional)")
print("\nStep 3: Create Endpoint")
print("  - Deploy model to instances")
print("  - Wait for 'InService' status (5-10 minutes)")
print("  - Get endpoint URL for invocations")
print("\nStep 4: Invoke Endpoint")
print("  - Send requests via API")
print("  - Get predictions in real-time")
print("  - Monitor performance")

# Example endpoint configuration
endpoint_config_example = {
    "EndpointName": "fraud-detection-prod",
    "InstanceType": "ml.m5.large",
    "InitialInstanceCount": 2,
    "ModelName": "fraud-detection-model-v1",
    "Tags": [
        {"Key": "Environment", "Value": "Production"},
        {"Key": "Application", "Value": "FraudDetection"}
    ]
}

print("\nüí° Example Endpoint Configuration:")
print(json.dumps(endpoint_config_example, indent=2))

## Part 2: Invoke from Python (Boto3)

**Most common method for production applications**

In [None]:
# Example: How to invoke a SageMaker endpoint with Python
print("üêç Python Endpoint Invocation Example:\n")

invocation_code = '''
import boto3
import json

# Initialize SageMaker runtime client
runtime = boto3.client('sagemaker-runtime', region_name='us-east-1')

# Prepare transaction data
transaction = {
    'amount': 5000.00,
    'merchant_category': 'electronics',
    'location': 'international',
    'time_of_day': 'night',
    'card_present': False
}

# Invoke endpoint
response = runtime.invoke_endpoint(
    EndpointName='fraud-detection-prod',
    ContentType='application/json',
    Body=json.dumps(transaction)
)

# Parse response
result = json.loads(response['Body'].read().decode())
print(f"Fraud Probability: {result['fraud_probability']:.2%}")
print(f"Decision: {result['decision']}")
'''

print(invocation_code)

print("\nüìä Expected Response Format:")
expected_response = {
    "fraud_probability": 0.87,
    "decision": "FLAGGED",
    "reason": "High amount + International + Night time",
    "confidence": 0.92
}
print(json.dumps(expected_response, indent=2))

print("\nüí° Key Points:")
print("  - Use 'sagemaker-runtime' client (not 'sagemaker')")
print("  - ContentType must match model's expected format")
print("  - Response body is a StreamingBody object")
print("  - Always decode and parse the response")

## Part 3: Batch Processing Pattern

In [None]:
# Batch processing example
print("üì¶ Batch Processing Pattern:\n")

batch_code = '''
# Process multiple transactions
transactions = [
    {'amount': 50, 'merchant': 'grocery', 'location': 'local'},
    {'amount': 10000, 'merchant': 'jewelry', 'location': 'international'},
    {'amount': 200, 'merchant': 'restaurant', 'location': 'local'},
]

results = []
for txn in transactions:
    response = runtime.invoke_endpoint(
        EndpointName='fraud-detection-prod',
        ContentType='application/json',
        Body=json.dumps(txn)
    )
    result = json.loads(response['Body'].read().decode())
    results.append(result)

# Analyze results
flagged = [r for r in results if r['decision'] == 'FLAGGED']
print(f"Flagged: {len(flagged)}/{len(results)} transactions")
'''

print(batch_code)

print("\n‚ö° Performance Considerations:")
print("  - Real-time: 1 request at a time (< 100ms latency)")
print("  - Batch: Multiple requests sequentially")
print("  - Async: Use SageMaker Batch Transform for large batches")
print("  - Parallel: Use threading/multiprocessing for speed")

print("\nüí∞ Cost Comparison:")
print("  Real-time endpoint: $0.115/hour (always running)")
print("  Batch Transform: $0.115/hour (only when processing)")
print("  Recommendation: Real-time for < 1M requests/day")

## Part 4: AWS CLI Invocation

In [None]:
# AWS CLI invocation example
print("üíª AWS CLI Endpoint Invocation:\n")

cli_example = '''
# Step 1: Create input file
cat > input.json << EOF
{
  "amount": 3000,
  "merchant_category": "travel",
  "location": "international"
}
EOF

# Step 2: Invoke endpoint
aws sagemaker-runtime invoke-endpoint \\
    --endpoint-name fraud-detection-prod \\
    --content-type application/json \\
    --body file://input.json \\
    output.json

# Step 3: View results
cat output.json
'''

print(cli_example)

print("\nüìù Use Cases for CLI:")
print("  - Quick testing during development")
print("  - Shell scripts for automation")
print("  - CI/CD pipeline integration")
print("  - Debugging endpoint issues")
print("  - One-off predictions")

print("\nüí° CLI vs Python:")
print("  CLI: Good for testing, simple scripts")
print("  Python: Better for production, error handling, complex logic")

## Part 5: Error Handling & Retry Logic

**Production-grade implementation**

In [None]:
import time
from botocore.exceptions import ClientError

def invoke_endpoint_with_retry(endpoint_name, payload, max_retries=3):
    """
    Invoke SageMaker endpoint with exponential backoff retry logic
    """
    for attempt in range(max_retries):
        try:
            response = runtime.invoke_endpoint(
                EndpointName=endpoint_name,
                ContentType='application/json',
                Body=json.dumps(payload)
            )
            return json.loads(response['Body'].read().decode())
            
        except ClientError as e:
            error_code = e.response['Error']['Code']
            
            if error_code == 'ThrottlingException':
                # Exponential backoff
                wait_time = (2 ** attempt) + (time.time() % 1)
                print(f"‚ö†Ô∏è Throttled. Retrying in {wait_time:.2f}s...")
                time.sleep(wait_time)
                
            elif error_code == 'ModelError':
                print(f"‚ùå Model error: {e}")
                raise
                
            elif error_code == 'ValidationError':
                print(f"‚ùå Invalid input: {e}")
                raise
                
            else:
                print(f"‚ùå Unexpected error: {e}")
                if attempt == max_retries - 1:
                    raise
                time.sleep(2 ** attempt)
    
    raise Exception(f"Failed after {max_retries} retries")

# Test retry logic
test_transaction = {
    'amount': 1500,
    'merchant_category': 'retail',
    'location': 'local',
    'time_of_day': 'afternoon',
    'card_present': True
}

print("Testing retry logic...")
result = invoke_endpoint_with_retry(endpoint_name, test_transaction)
print(f"‚úÖ Success: {result['decision']}")

## Part 6: CloudWatch Monitoring

In [None]:
# CloudWatch metrics for endpoints
print("üìà CloudWatch Metrics for SageMaker Endpoints:\n")

metrics_table = {
    'Metric': [
        'ModelLatency',
        'Invocations',
        'Invocation4XXErrors',
        'Invocation5XXErrors',
        'CPUUtilization',
        'MemoryUtilization'
    ],
    'Description': [
        'Time to process request',
        'Total number of requests',
        'Client errors (bad input)',
        'Server errors (model issues)',
        'CPU usage percentage',
        'Memory usage percentage'
    ],
    'Target': [
        '< 100ms',
        'Monitor trend',
        '< 1%',
        '< 0.1%',
        '50-70%',
        '< 85%'
    ],
    'Alert If': [
        '> 200ms for 5 min',
        'Sudden spike/drop',
        '> 5% for 5 min',
        '> 1% for 5 min',
        '> 80% for 10 min',
        '> 90% for 5 min'
    ]
}

import pandas as pd
df_metrics = pd.DataFrame(metrics_table)
print(df_metrics.to_string(index=False))

print("\nüîî Recommended CloudWatch Alarms:")
print("  1. High Latency: ModelLatency > 200ms for 5 minutes")
print("  2. High Error Rate: 4XX or 5XX errors > 5% for 5 minutes")
print("  3. High CPU: CPUUtilization > 80% for 10 minutes")
print("  4. High Memory: MemoryUtilization > 90% for 5 minutes")
print("  5. No Traffic: Invocations = 0 for 15 minutes (if expected traffic)")

print("\nüí° Monitoring Code Example:")
monitoring_code = '''
import boto3
cloudwatch = boto3.client('cloudwatch')

response = cloudwatch.get_metric_statistics(
    Namespace='AWS/SageMaker',
    MetricName='ModelLatency',
    Dimensions=[{'Name': 'EndpointName', 'Value': 'fraud-detection-prod'}],
    StartTime=datetime.now() - timedelta(hours=1),
    EndTime=datetime.now(),
    Period=300,  # 5 minutes
    Statistics=['Average', 'Maximum']
)
'''
print(monitoring_code)

## Part 7: Lambda Integration (Conceptual)

**How to integrate with Lambda for API Gateway**

In [None]:
# Lambda function code (for reference)
lambda_code = f"""
import json
import boto3

runtime = boto3.client('sagemaker-runtime')

def lambda_handler(event, context):
    # Parse transaction from API Gateway
    transaction = json.loads(event['body'])
    
    # Invoke SageMaker endpoint
    response = runtime.invoke_endpoint(
        EndpointName='{endpoint_name}',
        ContentType='application/json',
        Body=json.dumps(transaction)
    )
    
    # Parse result
    result = json.loads(response['Body'].read().decode())
    
    # Return to API Gateway
    return {{
        'statusCode': 200,
        'body': json.dumps(result),
        'headers': {{
            'Content-Type': 'application/json',
            'Access-Control-Allow-Origin': '*'
        }}
    }}
"""

print("üìù Lambda Function Code:")
print(lambda_code)
print("\nüí° Architecture: API Gateway ‚Üí Lambda ‚Üí SageMaker Endpoint")
print("\n‚úÖ Benefits:")
print("  - Serverless scaling")
print("  - Pay per request")
print("  - Easy API management")
print("  - Built-in authentication")

## Part 8: IAM Permissions Required

In [None]:
# IAM policy for endpoint invocation
iam_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:InvokeEndpoint"
            ],
            "Resource": f"arn:aws:sagemaker:{region}:*:endpoint/{endpoint_name}"
        }
    ]
}

print("üìù Required IAM Policy:")
print(json.dumps(iam_policy, indent=2))
print("\nüí° Attach this policy to:")
print("  - Lambda execution role")
print("  - EC2 instance role")
print("  - Application service account")
print("\n‚úÖ Principle of least privilege: Only InvokeEndpoint permission")

## Summary: Endpoint Usage Checklist

In [None]:
print("\n‚úÖ Endpoint Usage Checklist:\n")
print("1. Deployment:")
print("   ‚ñ° Model artifacts in S3")
print("   ‚ñ° Endpoint configuration created")
print("   ‚ñ° Endpoint deployed and 'InService'")
print("   ‚ñ° Test with sample data")
print("\n2. Invocation Methods:")
print("   ‚ñ° Python (Boto3) - Most common")
print("   ‚ñ° AWS CLI - For testing")
print("   ‚ñ° Lambda - For serverless integration")
print("   ‚ñ° REST API - For web applications")
print("\n3. Error Handling:")
print("   ‚ñ° Implement retry logic")
print("   ‚ñ° Handle ThrottlingException")
print("   ‚ñ° Handle ModelError")
print("   ‚ñ° Set appropriate timeouts")
print("\n4. Monitoring:")
print("   ‚ñ° Track ModelLatency")
print("   ‚ñ° Monitor Invocations count")
print("   ‚ñ° Watch for 4XX/5XX errors")
print("   ‚ñ° Set CloudWatch alarms")
print("\n5. IAM Permissions:")
print("   ‚ñ° sagemaker:InvokeEndpoint")
print("   ‚ñ° Least-privilege access")
print("   ‚ñ° Resource-specific policies")
print("\n6. Production Best Practices:")
print("   ‚ñ° Use multiple instances (HA)")
print("   ‚ñ° Enable auto-scaling")
print("   ‚ñ° Implement caching")
print("   ‚ñ° Use VPC endpoints")
print("   ‚ñ° Enable encryption")
print("   ‚ñ° Regular model updates")

print("\nüí° Remember: Always delete unused endpoints to avoid charges!")

## üéì Key Takeaways

1. **Invocation Methods:**
   - ‚úÖ Python (Boto3): Most common, production apps
   - ‚úÖ AWS CLI: Testing, shell scripts
   - ‚úÖ Lambda: Serverless integration
   - ‚úÖ REST API: Web applications

2. **Error Handling:**
   - ‚úÖ Implement retry logic with exponential backoff
   - ‚úÖ Handle ThrottlingException
   - ‚úÖ Handle ModelError and ValidationError
   - ‚úÖ Set appropriate timeouts

3. **Monitoring:**
   - ‚úÖ Track ModelLatency (target: <100ms)
   - ‚úÖ Monitor Invocations count
   - ‚úÖ Watch for 4XX/5XX errors
   - ‚úÖ Set CloudWatch alarms

4. **IAM Permissions:**
   - ‚úÖ sagemaker:InvokeEndpoint required
   - ‚úÖ Use least-privilege principle
   - ‚úÖ Resource-specific policies

5. **Production Best Practices:**
   - ‚úÖ Always implement retry logic
   - ‚úÖ Monitor metrics continuously
   - ‚úÖ Use Lambda for serverless scaling
   - ‚úÖ Set up CloudWatch alarms
   - ‚úÖ Test error scenarios