# Amazon Bedrock Model-as-a-Judge Evaluation Guide

## Introduction

This notebook demonstrates how to use Amazon Bedrock's Model-as-a-Judge feature for systematic model evaluation. The Model-as-a-Judge approach uses a foundation model to score another model's responses and provide explanations for the scores. The guide covers creating evaluation datasets, running evaluations, and comparing different foundation models.

## Contents

1. [Setup and Configuration](#setup)
2. [Dataset Generation](#dataset)
3. [S3 Integration](#s3)
4. [Single Model Evaluation](#single)
5. [Model Selection and Comparison](#comparison)
6. [Monitoring and Results](#monitoring)

## Prerequisites

- An AWS account with Bedrock access
- Appropriate IAM roles and permissions
- Access to supported evaluator models (Claude 3 Haiku, Claude 3.5 Sonnet, Mistral Large, or Meta Llama 3.1)
- An S3 bucket for storing evaluation data

Let's begin with updating boto3 to latest version

In [None]:
%pip install boto3 --upgrade

## Environment Setup <a name="setup"></a>

In [None]:
import boto3
import json
import random
from datetime import datetime
from typing import List, Dict, Any, Optional

# AWS Configuration
REGION = "<YOUR_REGION>"
ROLE_ARN = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
BUCKET_NAME = "<YOUR_BUCKET_NAME>"
PREFIX = "<YOUR_BUCKET_PREFIX>"

dataset_custom_name = "dummy-data"

# Initialize AWS clients
bedrock_client = boto3.client('bedrock', region_name=REGION)
s3_client = boto3.client('s3', region_name=REGION)

## Dataset Generation <a name="dataset"></a>

We'll create a simple dataset of mathematical reasoning problems. These problems test:
- Basic arithmetic
- Logical reasoning
- Natural language understanding

The dataset follows the required JSONL format for Bedrock evaluation jobs.

In [None]:
import random
import json

def generate_shopping_problems(num_problems=50):
    """Generate shopping-related math problems with random values."""
    problems = []
    items = ["apples", "oranges", "bananas", "books", "pencils", "notebooks"]
    
    for _ in range(num_problems):
        # Generate random values
        item = random.choice(items)
        quantity = random.randint(3, 20)
        price_per_item = round(random.uniform(1.5, 15.0), 2)
        discount_percent = random.choice([10, 15, 20, 25, 30])
        
        # Calculate the answer
        total_price = quantity * price_per_item
        discount_amount = total_price * (discount_percent / 100)
        final_price = round(total_price - discount_amount, 2)
        
        # Create the problem
        problem = {
            "prompt": f"If {item} cost \${price_per_item} each and you buy {quantity} of them with a {discount_percent}% discount, how much will you pay in total?",
            "category": "Shopping Math",
            "referenceResponse": f"The total price will be \${final_price}. Original price: \${total_price} minus {discount_percent}% discount (\${discount_amount})"
        }
        
        problems.append(problem)
    
    return problems

def save_to_jsonl(problems, output_file):
    """Save the problems to a JSONL file."""
    with open(output_file, 'w') as f:
        for problem in problems:
            f.write(json.dumps(problem) + '\n')

SAMPLE_SIZE = 30
problems = generate_shopping_problems(SAMPLE_SIZE)
save_to_jsonl(problems, f"{dataset_custom_name}.jsonl")

## S3 Integration <a name="s3"></a>

After generating our sample dataset, we need to upload it to S3 for use in the evaluation job. 
We'll use the boto3 S3 client to upload our JSONL file.

> **Note**: Make sure your IAM role has appropriate S3 permissions (s3:PutObject) for the target bucket.

Make sure IAM Role AmazonSageMakerServiceCatalogProdcuts has S3 Write Permissions.  You can add with this inline policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::nolanchn-us-west-2",
                "arn:aws:s3:::nolanchn-us-west-2/*"
            ]
        }
    ]
}

In [None]:
def upload_to_s3(local_file: str, bucket: str, s3_key: str) -> bool:
    """
    Upload a file to S3 with error handling.
    
    Returns:
        bool: Success status
    """
    try:
        s3_client.upload_file(local_file, bucket, s3_key)
        print(f"✓ Successfully uploaded to s3://{bucket}/{s3_key}")
        return True
    except Exception as e:
        print(f"✗ Error uploading to S3: {str(e)}")
        return False

# Upload dataset
s3_key = f"{PREFIX}/{dataset_custom_name}.jsonl"

#Needed to add S3 write permissisons via inline policy to IAM Role AamzonSageMakerServiceCatalogProductsUseRole
upload_success = upload_to_s3(f"{dataset_custom_name}.jsonl", BUCKET_NAME, s3_key)

if not upload_success:
    raise Exception("Failed to upload dataset to S3")

## Evaluation Job Configuration

Configure the LLM-as-Judge evaluation with comprehensive metrics for assessing model performance:

| Metric Category | Description |
|----------------|-------------|
| Quality | Correctness, Completeness, Faithfulness |
| User Experience | Helpfulness, Coherence, Relevance |
| Instructions | Following Instructions, Professional Style |
| Safety | Harmfulness, Stereotyping, Refusal |

In [None]:
def create_llm_judge_evaluation(
    client,
    job_name: str,
    role_arn: str,
    input_s3_uri: str,
    output_s3_uri: str,
    evaluator_model_id: str,
    generator_model_id: str,
    dataset_name: str = None,
    task_type: str = "General" # must be General for LLMaaJ
):    
    # All available LLM-as-judge metrics
    llm_judge_metrics = [
        "Builtin.Correctness",
        "Builtin.Completeness", 
        "Builtin.Faithfulness",
        "Builtin.Helpfulness",
        "Builtin.Coherence",
        "Builtin.Relevance",
        "Builtin.FollowingInstructions",
        "Builtin.ProfessionalStyleAndTone",
        "Builtin.Harmfulness",
        "Builtin.Stereotyping",
        "Builtin.Refusal"
    ]

    # Configure dataset
    dataset_config = {
        "name": dataset_name or "CustomDataset",
        "datasetLocation": {
            "s3Uri": input_s3_uri
        }
    }

    try:
        response = client.create_evaluation_job(
            jobName=job_name,
            roleArn=role_arn,
            applicationType="ModelEvaluation",
            evaluationConfig={
                "automated": {
                    "datasetMetricConfigs": [
                        {
                            "taskType": task_type,
                            "dataset": dataset_config,
                            "metricNames": llm_judge_metrics
                        }
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [
                            {
                                "modelIdentifier": evaluator_model_id
                            }
                        ]
                    }
                }
            },
            inferenceConfig={
                "models": [
                    {
                        "bedrockModel": {
                            "modelIdentifier": generator_model_id
                        }
                    }
                ]
            },
            outputDataConfig={
                "s3Uri": output_s3_uri
            }
        )
        return response
        
    except Exception as e:
        print(f"Error creating evaluation job: {str(e)}")
        raise

## Single Model Evaluation <a name="single"></a>

First, let's run a single evaluation job using Claude 3 Haiku as both generator and evaluator.

Make sure IAM Role AmazonSageMakerServiceCatalogProductsUseRole has iam:PassRole permission.
You can add via this inline policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::198383186964:role/service-role/Amazon-Bedrock-IAM-Role-*",
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "bedrock.amazonaws.com"
                }
            }
        }
    ]
}


In [None]:
# Job Configuration
evaluator_model = "anthropic.claude-3-haiku-20240307-v1:0"
generator_model = "anthropic.claude-3-haiku-20240307-v1:0"
job_name = f"llmaaj-{generator_model.split('.')[0]}-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# S3 Paths
input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl"
output_path = f"s3://{BUCKET_NAME}/{PREFIX}"

# Create evaluation job

#Need to add iam:PassRole permission to AmazoneSageMakerServiceCatalogProductUseRole
try:
    llm_as_judge_response = create_llm_judge_evaluation(
        client=bedrock_client,
        job_name=job_name,
        role_arn=ROLE_ARN,
        input_s3_uri=input_data,
        output_s3_uri=output_path,
        evaluator_model_id=evaluator_model,
        generator_model_id=generator_model,
        task_type="General"
    )
    print(f"✓ Created evaluation job: {llm_as_judge_response['jobArn']}")
except Exception as e:
    print(f"✗ Failed to create evaluation job: {str(e)}")
    raise

### Monitoring Job Progress
Track the status of your evaluation job:

In [None]:
# Get job ARN based on job type
evaluation_job_arn = llm_as_judge_response['jobArn']

# Check job status
check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn) 
print(f"Job Status: {check_status['status']}")

## Model Selection and Comparison <a name="comparison"></a>

Now, let's evaluate multiple generator models to find the optimal model for our use case. We'll compare different foundation models while using a consistent evaluator.

In [None]:
# Available Generator Models

#anthropic.claude-3-haiku-20240307-v1:0
#amazon.nova-micro-v1:0
#us.deepseek.r1-v1:0
#us.writer.palmyra-x4-v1:0
#us.writer.palmyra-x5-v1:0


GENERATOR_MODELS = [
    "us.writer.palmyra-x4-v1:0",
    "us.writer.palmyra-x5-v1:0"
]



# Consistent Evaluator

#EVALUATOR_MODEL = "us.amazon.nova-pro-v1:0"
EVALUATOR_MODEL = "anthropic.claude-3-haiku-20240307-v1:0"

def run_model_comparison(
    generator_models: List[str],
    evaluator_model: str
) -> List[Dict[str, Any]]:
    evaluation_jobs = []
    
    for generator_model in generator_models:
        job_name = f"llmaaj-{generator_model.split('.')[0]}-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
        
        try:
            response = create_llm_judge_evaluation(
                client=bedrock_client,
                job_name=job_name,
                role_arn=ROLE_ARN,
                input_s3_uri=input_data,
                output_s3_uri=f"{output_path}/{job_name}/",
                evaluator_model_id=evaluator_model,
                generator_model_id=generator_model,
                task_type="General"
            )
            
            job_info = {
                "job_name": job_name,
                "job_arn": response["jobArn"],
                "generator_model": generator_model,
                "evaluator_model": evaluator_model,
                "status": "CREATED"
            }
            evaluation_jobs.append(job_info)
            
            print(f"✓ Created job: {job_name}")
            print(f"  Generator: {generator_model}")
            print(f"  Evaluator: {evaluator_model}")
            print("-" * 80)
            
        except Exception as e:
            print(f"✗ Error with {generator_model}: {str(e)}")
            continue
            
    return evaluation_jobs

# Run model comparison
evaluation_jobs = run_model_comparison(GENERATOR_MODELS, EVALUATOR_MODEL)

## Monitoring and Results <a name="monitoring"></a>

Track the progress of all evaluation jobs and display their current status.

In [None]:
# function to check job status
def check_jobs_status(jobs, client):
    """Check and update status for all evaluation jobs"""
    for job in jobs:
        try:
            response = client.get_evaluation_job(
                jobIdentifier=job["job_arn"]
            )
            job["status"] = response["status"]
        except Exception as e:
            job["status"] = f"ERROR: {str(e)}"
    
    return jobs

# Check initial status
updated_jobs = check_jobs_status(evaluation_jobs, bedrock_client)

# Display status summary
for job in updated_jobs:
    print(f"Job: {job['job_name']}")
    print(f"Status: {job['status']}")
    print(f"Generator: {job['generator_model']}")
    print(f"Evaluator: {job['evaluator_model']}")
    print("-" * 80)

## Spearman's Correlation Analysis Between Multiple Generator Models

* To calculate the Spearman's rank correlation between generator models, first read the evaluation results from S3 using the path structure:
```s3://[output-path]/[job-name]/[job-uuid]/models/[model-id]/taskTypes/[task-type]/datasets/dataset/[file-uuid]_output.jsonl```
- Each file contains evaluation scores across different metrics (Correctness, Completeness, Helpfulness, Coherence, and Faithfulness).

* Use scipy.stats to compute the correlation coefficient between pairs of generator models, filtering out any constant values or error messages. 

* The resulting correlation matrix helps identify which models produce similar outputs and where they differ significantly in their response patterns. Higher correlation coefficients (closer to 1.0) indicate stronger agreement between models' responses.

In [None]:
import json
import boto3
import numpy as np
from scipy import stats

def read_and_organize_metrics_from_s3(bucket_name, file_key):
    s3_client = boto3.client('s3')
    metrics_dict = {}
    
    try:
        response = s3_client.get_object(Bucket=bucket_name, Key=file_key)
        content = response['Body'].read().decode('utf-8')
        
        for line in content.strip().split('\n'):
            if line:
                data = json.loads(line)
                if 'automatedEvaluationResult' in data and 'scores' in data['automatedEvaluationResult']:
                    for score in data['automatedEvaluationResult']['scores']:
                        metric_name = score['metricName']
                        if 'result' in score:
                            metric_value = score['result']
                            if metric_name not in metrics_dict:
                                metrics_dict[metric_name] = []
                            metrics_dict[metric_name].append(metric_value)
        return metrics_dict
    
    except Exception as e:
        print(f"Error: {e}")
        return None

def get_spearmanr_correlation(scores1, scores2):
    if len(set(scores1)) == 1 or len(set(scores2)) == 1:
        return "undefined (constant scores)", "undefined"
    
    try:
        result = stats.spearmanr(scores1, scores2)
        return round(float(result.statistic), 4), round(float(result.pvalue), 4)
    except Exception as e:
        return f"error: {str(e)}", "undefined"

# Extract metrics
bucket_name = "nolanchn-us-west-2"
file_key1 = "<EVALUATION_FILE_KEY1>"
file_key2 = "<EVALUATION_FILE_KEY2>"

metrics1 = read_and_organize_metrics_from_s3(bucket_name, file_key1)
metrics2 = read_and_organize_metrics_from_s3(bucket_name, file_key2)

# Calculate correlations for common metrics
common_metrics = set(metrics1.keys()) & set(metrics2.keys())

for metric_name in common_metrics:
    scores1 = metrics1[metric_name]
    scores2 = metrics2[metric_name]
    
    if len(scores1) == len(scores2):
        correlation, p_value = get_spearmanr_correlation(scores1, scores2)
        
        print(f"\nMetric: {metric_name}")
        print(f"Number of samples: {len(scores1)}")
        print(f"Unique values in Model 1 scores: {len(set(scores1))}")
        print(f"Unique values in Model 2 scores: {len(set(scores2))}")
        print(f"Model 1 scores range: [{min(scores1)}, {max(scores1)}]")
        print(f"Model 2 scores range: [{min(scores2)}, {max(scores2)}]")
        print(f"Spearman correlation coefficient: {correlation}")
        print(f"P-value: {p_value}")
    else:
        print(f"\nMetric: {metric_name}")
        print("Error: Different number of samples between models")

## Next Steps

After running the evaluation job:
1. Monitor the job status in the Bedrock console or through `get_evaluation_job` API
2. Review the report card for:
   - Score distributions across different metrics
   - Detailed explanations for scoring provided by the judge model
   - Overall performance analysis
3. Access full results in your specified S3 bucket

> **Note**: The evaluation results will help you understand your model's strengths and areas for improvement across multiple dimensions of performance.