# Qwen Hebrew Fine-tuning: SageMaker P4/P5 Instance Benchmark

This notebook provides a complete workflow for:
1. Setting up SageMaker infrastructure
2. Running performance benchmarks across P4d, P4de, P5, P5e, and P5en instances
3. Analyzing results and generating recommendations

## Prerequisites
- AWS credentials configured
- SageMaker execution role with appropriate permissions
- S3 bucket for data and model storage
- Docker containers built and pushed to ECR

In [None]:
# Install required packages
!pip install boto3 sagemaker pandas matplotlib seaborn wandb

Collecting sagemaker
  Downloading sagemaker-2.247.1-py3-none-any.whl.metadata (17 kB)
Collecting matplotlib
  Downloading matplotlib-3.10.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting wandb
  Using cached wandb-0.21.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting cloudpickle>=2.2.1 (from sagemaker)
  Downloading cloudpickle-3.1.1-py3-none-any.whl.metadata (7.1 kB)
Collecting docker (from sagemaker)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting fastapi (from sagemaker)
  Downloading fastapi-0.115.14-py3-none-any.whl.metadata (27 kB)
Collecting google-pasta (from sagemaker)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting graphene<4,>=3 (from sagemaker)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting importlib-metadata<7.0,>=1.4.0 (from

In [9]:
import boto3
import sagemaker
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import time
from datetime import datetime
import sys
import os

# Add the scripts directory to Python path
sys.path.append('../scripts')
sys.path.append('../infrastructure')

from benchmark_runner import SageMakerBenchmarkRunner
#from containers.training.scripts.sagemaker_jobs import SageMakerJobManager

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

## Configuration

Update these variables with your AWS configuration:

In [12]:
# AWS Configuration
REGION = 'us-east-1'
BUCKET_NAME = 'gepeta-datasets'  # Update with your bucket name
ROLE_ARN = 'arn:aws:iam::670967753077:role/SageMakerExecutionRole'  # Update with your role

# Dataset Configuration
DATASET_S3_PATH = f's3://{BUCKET_NAME}/processed/wikipedia/'  # Path to your processed Hebrew dataset

# Benchmark Configuration
INSTANCE_TYPES = ['ml.p4d.24xlarge', 'ml.p4de.24xlarge', 'ml.p5.48xlarge']
BENCHMARK_EPOCHS = 1
BENCHMARK_MAX_STEPS = 100  # For quick benchmarking

# W&B Configuration (optional)
WANDB_PROJECT = 'qwen-hebrew-sagemaker-benchmark'
WANDB_ENTITY = 'your-wandb-entity'  # Update with your W&B entity

print(f"Region: {REGION}")
print(f"Bucket: {BUCKET_NAME}")
print(f"Dataset: {DATASET_S3_PATH}")
print(f"Instance types: {INSTANCE_TYPES}")

Region: us-east-1
Bucket: gepeta-datasets
Dataset: s3://gepeta-datasets/processed/wikipedia/
Instance types: ['ml.p4d.24xlarge', 'ml.p4de.24xlarge', 'ml.p5.48xlarge']


## Step 1: Verify Setup

In [13]:
# Initialize AWS clients
sagemaker_client = boto3.client('sagemaker', region_name=REGION)
s3_client = boto3.client('s3', region_name=REGION)
sts_client = boto3.client('sts')

# Get account information
account_id = sts_client.get_caller_identity()['Account']
print(f"Account ID: {account_id}")

# Verify S3 bucket access
try:
    s3_client.head_bucket(Bucket=BUCKET_NAME)
    print(f"✓ S3 bucket {BUCKET_NAME} is accessible")
except Exception as e:
    print(f"✗ S3 bucket {BUCKET_NAME} is not accessible: {e}")

# Check if dataset exists
try:
    dataset_key = DATASET_S3_PATH.replace(f's3://{BUCKET_NAME}/', '')
    response = s3_client.list_objects_v2(Bucket=BUCKET_NAME, Prefix=dataset_key, MaxKeys=1)
    if 'Contents' in response:
        print(f"✓ Dataset found at {DATASET_S3_PATH}")
    else:
        print(f"✗ Dataset not found at {DATASET_S3_PATH}")
        print("Please run data preparation first or update the dataset path")
except Exception as e:
    print(f"✗ Error checking dataset: {e}")

# Verify ECR images
ecr_client = boto3.client('ecr', region_name=REGION)
training_image = f"{account_id}.dkr.ecr.{REGION}.amazonaws.com/qwen-hebrew-training:latest"

try:
    ecr_client.describe_images(
        repositoryName='qwen-hebrew-training',
        imageIds=[{'imageTag': 'latest'}]
    )
    print(f"✓ Training container image found: {training_image}")
except Exception as e:
    print(f"✗ Training container image not found: {e}")
    print("Please build and push the Docker container first")

Account ID: 670967753077
✓ S3 bucket gepeta-datasets is accessible
✓ Dataset found at s3://gepeta-datasets/processed/wikipedia/
✓ Training container image found: 670967753077.dkr.ecr.us-east-1.amazonaws.com/qwen-hebrew-training:latest


## Step 2: Initialize Benchmark Runner

In [14]:
# Initialize benchmark runner
benchmark_runner = SageMakerBenchmarkRunner(
    role_arn=ROLE_ARN,
    bucket_name=BUCKET_NAME,
    region=REGION
)

print("Benchmark runner initialized successfully")
print(f"Training image URI: {benchmark_runner.get_training_image_uri()}")

Benchmark runner initialized successfully
Training image URI: 670967753077.dkr.ecr.us-east-1.amazonaws.com/qwen-hebrew-training:latest


## Step 3: Run Performance Benchmark

This will submit training jobs to all specified instance types and monitor their progress.

In [15]:
# Start benchmark
print(f"Starting benchmark across {len(INSTANCE_TYPES)} instance types...")
print(f"This will take approximately 1-2 hours to complete")
print(f"Instance types: {INSTANCE_TYPES}")

start_time = datetime.now()
print(f"Benchmark started at: {start_time}")

# Run the benchmark
results = benchmark_runner.run_benchmark(
    instance_types=INSTANCE_TYPES,
    dataset_path=DATASET_S3_PATH,
    epochs=BENCHMARK_EPOCHS,
    max_steps=BENCHMARK_MAX_STEPS
)

end_time = datetime.now()
total_time = end_time - start_time
print(f"\nBenchmark completed at: {end_time}")
print(f"Total benchmark time: {total_time}")
print(f"Collected results for {len(results)} jobs")

2025-07-06 20:36:39,893 - INFO - Starting benchmark across 3 instance types
2025-07-06 20:36:39,895 - INFO - Instance types: ['ml.p4d.24xlarge', 'ml.p4de.24xlarge', 'ml.p5.48xlarge']
2025-07-06 20:36:39,895 - INFO - Submitting training job: qwen-benchmark-p4d-24xlarge-20250706-203639 on ml.p4d.24xlarge


Starting benchmark across 3 instance types...
This will take approximately 1-2 hours to complete
Instance types: ['ml.p4d.24xlarge', 'ml.p4de.24xlarge', 'ml.p5.48xlarge']
Benchmark started at: 2025-07-06 20:36:39.893648


2025-07-06 20:36:40,124 - ERROR - Failed to submit job for ml.p4d.24xlarge: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p4d.24xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.
2025-07-06 20:36:40,124 - INFO - Submitting training job: qwen-benchmark-p4de-24xlarge-20250706-203640 on ml.p4de.24xlarge
2025-07-06 20:36:40,414 - ERROR - Failed to submit job for ml.p4de.24xlarge: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Could not assume role arn:aws:iam::670967753077:role/SageMakerExecutionRole. Please ensure that the role exists and allows principal 'sagemaker.amazonaws.com' to assume the role.
2025-07-06 20:36:40,415 - INFO - Submi


Benchmark completed at: 2025-07-06 20:36:41.408575
Total benchmark time: 0:00:01.514927
Collected results for 0 jobs


## Step 4: Analyze Results

In [None]:
# Generate comparison report
if results:
    df = benchmark_runner.generate_comparison_report(results)
    
    # Display results
    print("BENCHMARK RESULTS SUMMARY")
    print("=" * 80)
    
    # Key metrics to display
    display_columns = [
        'instance_type', 'status', 'training_duration_hours', 
        'actual_cost', 'avg_tokens_per_second', 'avg_gpu_utilization',
        'cost_effectiveness', 'overall_score'
    ]
    
    available_columns = [col for col in display_columns if col in df.columns]
    display_df = df[available_columns].round(3)
    
    print(display_df.to_string(index=False))
    
    # Save results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    results_file = f"benchmark_results_{timestamp}.csv"
    df.to_csv(results_file, index=False)
    print(f"\nDetailed results saved to: {results_file}")
    
else:
    print("No benchmark results available")
    df = pd.DataFrame()

## Step 5: Visualize Performance Comparison

In [None]:
if not df.empty:
    # Create performance comparison plots
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('SageMaker P4/P5 Instance Performance Comparison', fontsize=16, fontweight='bold')
    
    # 1. Training Speed (Tokens per Second)
    if 'avg_tokens_per_second' in df.columns:
        ax1 = axes[0, 0]
        bars1 = ax1.bar(df['instance_type'], df['avg_tokens_per_second'], 
                       color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
        ax1.set_title('Training Speed (Tokens/Second)', fontweight='bold')
        ax1.set_ylabel('Tokens per Second')
        ax1.tick_params(axis='x', rotation=45)
        
        # Add value labels on bars
        for bar, value in zip(bars1, df['avg_tokens_per_second']):
            ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                    f'{value:.0f}', ha='center', va='bottom', fontweight='bold')
    
    # 2. Cost Comparison
    if 'actual_cost' in df.columns:
        ax2 = axes[0, 1]
        bars2 = ax2.bar(df['instance_type'], df['actual_cost'], 
                       color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
        ax2.set_title('Training Cost (USD)', fontweight='bold')
        ax2.set_ylabel('Cost (USD)')
        ax2.tick_params(axis='x', rotation=45)
        
        # Add value labels on bars
        for bar, value in zip(bars2, df['actual_cost']):
            ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                    f'${value:.2f}', ha='center', va='bottom', fontweight='bold')
    
    # 3. Cost Effectiveness
    if 'cost_effectiveness' in df.columns:
        ax3 = axes[1, 0]
        bars3 = ax3.bar(df['instance_type'], df['cost_effectiveness'], 
                       color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
        ax3.set_title('Cost Effectiveness (Performance/Dollar)', fontweight='bold')
        ax3.set_ylabel('Cost Effectiveness Score')
        ax3.tick_params(axis='x', rotation=45)
        
        # Add value labels on bars
        for bar, value in zip(bars3, df['cost_effectiveness']):
            ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                    f'{value:.3f}', ha='center', va='bottom', fontweight='bold')
    
    # 4. GPU Utilization
    if 'avg_gpu_utilization' in df.columns:
        ax4 = axes[1, 1]
        bars4 = ax4.bar(df['instance_type'], df['avg_gpu_utilization'], 
                       color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
        ax4.set_title('Average GPU Utilization (%)', fontweight='bold')
        ax4.set_ylabel('GPU Utilization (%)')
        ax4.set_ylim(0, 100)
        ax4.tick_params(axis='x', rotation=45)
        
        # Add value labels on bars
        for bar, value in zip(bars4, df['avg_gpu_utilization']):
            ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                    f'{value:.1f}%', ha='center', va='bottom', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Save the plot
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    plot_file = f"benchmark_comparison_{timestamp}.png"
    fig.savefig(plot_file, dpi=300, bbox_inches='tight')
    print(f"Performance comparison plot saved to: {plot_file}")

else:
    print("No data available for visualization")

## Step 6: Generate Recommendations

In [None]:
if not df.empty:
    print("\n" + "="*80)
    print("RECOMMENDATIONS")
    print("="*80)
    
    # Best overall performance
    if 'overall_score' in df.columns:
        best_overall = df.loc[df['overall_score'].idxmax()]
        print(f"🏆 BEST OVERALL PERFORMANCE: {best_overall['instance_type']}")
        print(f"   • Overall Score: {best_overall['overall_score']:.3f}")
        if 'actual_cost' in best_overall:
            print(f"   • Cost: ${best_overall['actual_cost']:.2f}")
        if 'avg_tokens_per_second' in best_overall:
            print(f"   • Speed: {best_overall['avg_tokens_per_second']:.0f} tokens/sec")
        if 'avg_gpu_utilization' in best_overall:
            print(f"   • GPU Utilization: {best_overall['avg_gpu_utilization']:.1f}%")
    
    # Most cost-effective
    if 'cost_effectiveness' in df.columns:
        best_cost = df.loc[df['cost_effectiveness'].idxmax()]
        print(f"\n💰 MOST COST-EFFECTIVE: {best_cost['instance_type']}")
        print(f"   • Cost Effectiveness: {best_cost['cost_effectiveness']:.3f}")
        if 'actual_cost' in best_cost:
            print(f"   • Cost: ${best_cost['actual_cost']:.2f}")
        if 'avg_tokens_per_second' in best_cost:
            print(f"   • Speed: {best_cost['avg_tokens_per_second']:.0f} tokens/sec")
    
    # Fastest training
    if 'avg_tokens_per_second' in df.columns:
        fastest = df.loc[df['avg_tokens_per_second'].idxmax()]
        print(f"\n⚡ FASTEST TRAINING: {fastest['instance_type']}")
        print(f"   • Speed: {fastest['avg_tokens_per_second']:.0f} tokens/sec")
        if 'training_duration_hours' in fastest:
            print(f"   • Duration: {fastest['training_duration_hours']:.2f} hours")
        if 'actual_cost' in fastest:
            print(f"   • Cost: ${fastest['actual_cost']:.2f}")
    
    # Usage recommendations
    print(f"\n📋 USAGE RECOMMENDATIONS:")
    print(f"   • For budget-conscious projects: Use the most cost-effective instance")
    print(f"   • For time-critical projects: Use the fastest instance")
    print(f"   • For balanced workloads: Use the best overall performance instance")
    print(f"   • For production training: Consider the instance with best GPU utilization")
    
    # Cost projections
    print(f"\n💡 COST PROJECTIONS FOR FULL TRAINING (3 epochs):")
    for _, row in df.iterrows():
        if 'actual_cost' in row and 'training_duration_hours' in row:
            # Estimate cost for 3 epochs (assuming linear scaling)
            full_cost = (row['actual_cost'] / BENCHMARK_EPOCHS) * 3
            full_time = (row['training_duration_hours'] / BENCHMARK_EPOCHS) * 3
            print(f"   • {row['instance_type']}: ~${full_cost:.2f} (~{full_time:.1f} hours)")

else:
    print("No data available for recommendations")

## Step 7: Save Complete Report

In [None]:
if not df.empty:
    # Create comprehensive report
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    report_file = f"qwen_hebrew_benchmark_report_{timestamp}.md"
    
    with open(report_file, 'w') as f:
        f.write("# Qwen Hebrew Fine-tuning: SageMaker P4/P5 Instance Benchmark Report\n\n")
        f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
        
        f.write("## Benchmark Configuration\n\n")
        f.write(f"- **Instance Types:** {', '.join(INSTANCE_TYPES)}\n")
        f.write(f"- **Dataset:** {DATASET_S3_PATH}\n")
        f.write(f"- **Epochs:** {BENCHMARK_EPOCHS}\n")
        f.write(f"- **Max Steps:** {BENCHMARK_MAX_STEPS}\n")
        f.write(f"- **Region:** {REGION}\n\n")
        
        f.write("## Results Summary\n\n")
        f.write(df.to_markdown(index=False, floatfmt=".3f"))
        f.write("\n\n")
        
        f.write("## Recommendations\n\n")
        
        if 'overall_score' in df.columns:
            best_overall = df.loc[df['overall_score'].idxmax()]
            f.write(f"### Best Overall Performance: {best_overall['instance_type']}\n")
            f.write(f"- Overall Score: {best_overall['overall_score']:.3f}\n")
            if 'actual_cost' in best_overall:
                f.write(f"- Cost: ${best_overall['actual_cost']:.2f}\n")
            if 'avg_tokens_per_second' in best_overall:
                f.write(f"- Speed: {best_overall['avg_tokens_per_second']:.0f} tokens/sec\n")
            f.write("\n")
        
        if 'cost_effectiveness' in df.columns:
            best_cost = df.loc[df['cost_effectiveness'].idxmax()]
            f.write(f"### Most Cost-Effective: {best_cost['instance_type']}\n")
            f.write(f"- Cost Effectiveness: {best_cost['cost_effectiveness']:.3f}\n")
            if 'actual_cost' in best_cost:
                f.write(f"- Cost: ${best_cost['actual_cost']:.2f}\n")
            f.write("\n")
        
        f.write("## Usage Guidelines\n\n")
        f.write("- **Budget-conscious projects:** Use the most cost-effective instance\n")
        f.write("- **Time-critical projects:** Use the fastest instance\n")
        f.write("- **Balanced workloads:** Use the best overall performance instance\n")
        f.write("- **Production training:** Consider GPU utilization and stability\n")
    
    print(f"\nComprehensive report saved to: {report_file}")
    
    # Upload report to S3
    try:
        s3_key = f"benchmark_reports/{report_file}"
        s3_client.upload_file(report_file, BUCKET_NAME, s3_key)
        print(f"Report uploaded to: s3://{BUCKET_NAME}/{s3_key}")
    except Exception as e:
        print(f"Failed to upload report to S3: {e}")

else:
    print("No data available for report generation")

## Next Steps

Based on the benchmark results, you can now:

1. **Choose the optimal instance type** for your production training
2. **Submit a full training job** using the recommended instance
3. **Scale your training** based on the performance characteristics
4. **Optimize costs** by selecting the most cost-effective option

### Submit Production Training Job

Use the cell below to submit a production training job with your chosen instance type:

In [None]:
# Submit production training job (uncomment and modify as needed)

# # Choose your preferred instance type based on benchmark results
# PRODUCTION_INSTANCE_TYPE = 'ml.p4de.24xlarge'  # Update based on your benchmark results
# PRODUCTION_EPOCHS = 3

# # Initialize job manager
# job_manager = SageMakerJobManager(
#     role_arn=ROLE_ARN,
#     bucket_name=BUCKET_NAME,
#     region=REGION
# )

# # Submit production training job
# timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
# production_job_name = f"qwen-hebrew-production-{timestamp}"

# job_name = job_manager.submit_training_job(
#     job_name=production_job_name,
#     instance_type=PRODUCTION_INSTANCE_TYPE,
#     dataset_path=DATASET_S3_PATH,
#     epochs=PRODUCTION_EPOCHS,
#     wandb_project='qwen-hebrew-production',
#     checkpoint_s3_uri=f's3://{BUCKET_NAME}/checkpoints/{production_job_name}/'
# )

# print(f"Production training job submitted: {job_name}")
# print(f"Instance type: {PRODUCTION_INSTANCE_TYPE}")
# print(f"Epochs: {PRODUCTION_EPOCHS}")
# print(f"Monitor progress in SageMaker console or W&B dashboard")