# NIST SP 800-53 Controls Processing Pipeline
## Step 4: Judge Scenarios

**Purpose**: Evaluate LLM performance on NIST compliance scenarios by comparing model judgments against ground truth labels to assess accuracy and price/performance characteristics across different models.

**Author**: NIST Compliance LLM Evaluation Framework  
**Last Updated**: January 2025  
**Environment**: SageMaker Studio, Python 3.11

### Overview

This notebook serves as the fourth and final stage in a four-part pipeline for evaluating LLM performance on NIST compliance scenarios. It processes compliance scenarios generated in Step 3, evaluating various LLMs' ability to correctly assess compliance against organizational policies derived from NIST controls.

### Business Context

Organizations implementing NIST cybersecurity frameworks need reliable automated tools to assess compliance. This pipeline enables systematic evaluation of different LLM models' accuracy, consistency, and cost-effectiveness for compliance automation tasks, supporting data-driven decisions for regulatory technology investments.

### Data

#### Input
- Labeled compliance scenarios from S3 (JSON format with ground truth labels)
- Policy documents retrieved from S3 per scenario
#### Output
 - Model evaluation results with accuracy and token consumption stored in S3

### Input JSON Structure

Scenarios loaded from S3 with the following structure:
```json
{
  "scenarios": [
    {
      "scenario-id": "scenario-id-1",
      "scenario-detail": "A new employee, Sarah Johnson, joins the IT department... Policies referenced: POL_AC-2, POL_IA-4",
      "is-compliant": false,
      "non-compliant-reason": "The scenario violates access control policy..."
    }
  ]
}
```

### Output JSON Structure

Judged scenarios with LLM evaluation results:
```json
{
  "scenarios": [
    {
      "scenario-id": "scenario-id-1",
      "scenario-detail": "A new employee, Sarah Johnson, joins the IT department...",
      "is-compliant": false,
      "non-compliant-reason": "The scenario violates...",
      "judged-compliant": true,
      "judged-compliant-reason": "Considered the rules AC... and scenario is not in violation...",
      "llm-judge": "global.anthropic.claude-opus-4-5-20251101-v1:0",
      "llm-judge-temp": 0.1,
      "llm-judge-input-tokens": 1250,
      "llm-judge-output-tokens": 180,
      "llm-judge-total-tokens": 1430,
      "judged-dtm": "2026-02-08T23:15:17.215554"
    }
  ]
}
```
### Processing Steps

1. Retrieves labeled compliance test scenarios from S3 storage (`scenarios/` prefix)
2. Parses "Policies referenced:" from scenario text to identify relevant policies
3. Loads specific policy documents from S3 (`policies/markdown/all-policies-main/` prefix)
4. Tests LLM models against scenarios using AWS Bedrock with structured JSON output
5. Records input/output tokens and total costs per scenario evaluation
6. Saves evaluation results to S3 (`scenarios-judged/` prefix) with streaming output

### Key Features

- Evaluates Claude (Converse API with tools) and Nova (invoke_model with JSON parsing)
- Extracts policy IDs from scenario text and retrieves full policy content from S3
- Forces Claude models to return JSON with compliance reasoning via tool configuration (not supported by Nova)
- Tracks token usage (input/output/total) and processing time per scenario
- Saves results incrementally to prevent data loss during long-running evaluations
- Exponential backoff retry logic for API throttling (max 5 retries)
- Background thread prevents session timeout during long evaluations

### Configuration Parameters

- Token Limits: 4096 max tokens per Bedrock call
- 5 max retries with exponential backoff (2^attempt seconds)

### Tool Configuration

For Claude only (not supported by Nova), the notebook uses Bedrock Converse API with two tools:
1. Forces structured JSON output with `judged-compliant` (boolean) and `judged-compliant-reason` (string)
2. compliance_calculator handles time/money/data unit comparisons in policy evaluation

### Output Artifacts

- JSON files with model judgments, token usage, and timestamps
- Per-scenario token counts for price/performance analysis
- Files named `judged_scenarios_batch-{scenario_file}-{model_name}-temp{temperature}.json`

### Dependencies

- AWS Bedrock access with model permissions (Claude, Nova)
- AWS S3 read/write access to compliance bucket
- Python libraries: boto3, datetime, json, re, time, threading, pathlib
- compliance_utils module (COMPLIANCE_JUDGE_PROMPT, compliance_calculator)
- Compliance scenarios from Step 3 and policy documents from Step 2

### Architecture Notes

- Uses direct S3 policy retrieval instead of Knowledge Base for 100% accuracy
- Supports both Claude (tool-based) and Nova (JSON parsing) model types
- Implements streaming file output to handle large scenario batches
- Includes error handling and progress tracking



In [1]:
import boto3
import datetime
import json
import re
import time   # For rate limiting between API calls
import threading
from botocore.exceptions import ClientError
from compliance_utils import compliance_calculator, CALCULATOR_TOOL, COMPLIANCE_JUDGE_PROMPT
from pathlib import Path
from typing import List, Dict, Callable, Any

FOLDER_HOME: Path = Path('/home/sagemaker-user')
FOLDER_JUDGED_SCENARIOS: Path = FOLDER_HOME / 'data/judged_scenarios/'
BUCKET = '183023889407-us-east-1-compliance-rule-generator'
S3_SOURCE_SCENARIOS = 'scenarios/'  # Folder path in S3 where scenarios are stored
S3_SOURCE_POLICY_ALL = 'policies/markdown/all-policies-main/'
S3_JUDGED_SCENARIOS = 'scenarios-judged/'  # Folder path for results
AWS_REGION = 'us-east-1'
# KNOWLEDGE_BASE_ID = 'T8EW10IU3Z' - using s3 to retrieve policies, KB performance unsatisfactory
MAX_TOKENS = 4096
MAX_RETRIES_ON_THROTTLE = 5

# Tool configuration for Bedrock Converse API
# Forces the model to return structured JSON with specific schema, and use calculator tool
TOOL_CONFIG = {
    "tools": [
        {
            "toolSpec": {
                "name": "judged_scenario_json",
                "description": "Return judged compliance scenarios as JSON",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "scenarios": {
                                "type": "array",
                                "items": {
                                    "type": "object",
                                    "properties": {
                                        "judged-compliant": {"type": "boolean"},
                                        "judged-compliant-reason": {"type": "string"}
                                    },
                                    "required": ["judged-compliant", "judged-compliant-reason"]
                                }
                            }
                        },
                        "required": ["scenarios"]
                    }
                }
            }
        },
        {
            "toolSpec": {
                "name": "compliance_calculator",
                "description": "Calculate and compare values with time, money, data, and percentage units",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "expression": {"type": "string", "description": "Expression like '800ms < 1s' or '4m > 3b'"}
                        },
                        "required": ["expression"]
                    }
                }
            }
        }
    ]
}

MODELS = [
        {
            'name': 'claude_3_7_sonnet-t0.0',
            'arn': 'arn:aws:bedrock:us-east-1:183023889407:inference-profile/us.anthropic.claude-3-7-sonnet-20250219-v1:0',
            'temperature': 0.0
        },
        {
            'name': 'claude_3_7_sonnett-t0.1',
            'arn': 'arn:aws:bedrock:us-east-1:183023889407:inference-profile/us.anthropic.claude-3-7-sonnet-20250219-v1:0',
            'temperature': 0.1
        },
        {
            'name': 'claude_4_sonnet-t0.0',
            'arn': 'arn:aws:bedrock:us-east-1:183023889407:inference-profile/us.anthropic.claude-sonnet-4-20250514-v1:0',
            'temperature': 0.0
        },
        {
            'name': 'claude_4_sonnet-t0.1',
            'arn': 'arn:aws:bedrock:us-east-1:183023889407:inference-profile/us.anthropic.claude-sonnet-4-20250514-v1:0',
            'temperature': 0.1
        },
        {
            'name': 'claude_opus_4_5-t0.0',
            'arn': 'arn:aws:bedrock:us-east-1:183023889407:inference-profile/global.anthropic.claude-opus-4-5-20251101-v1:0',
            'temperature': 0.0
        },
        {
            'name': 'claude_opus_4_5-t0.1',
            'arn': 'arn:aws:bedrock:us-east-1:183023889407:inference-profile/global.anthropic.claude-opus-4-5-20251101-v1:0',
            'temperature': 0.1
        },
        {
            'name': 'nova_premier-t0.0',
            'arn': 'arn:aws:bedrock:us-east-1:183023889407:inference-profile/us.amazon.nova-premier-v1:0',
            'temperature': 0.0
        },
        {
            'name': 'nova_premier-t0.1',
            'arn': 'arn:aws:bedrock:us-east-1:183023889407:inference-profile/us.amazon.nova-premier-v1:0',
            'temperature': 0.1
        },
        {
            'name': 'nova_2_lite-t0.0',
            'arn': 'arn:aws:bedrock:us-east-1:183023889407:inference-profile/us.amazon.nova-2-lite-v1:0',
            'temperature': 0.0
        }
    ]

SCENARIOS = ["scenarios-10-policies-each-200.json", "scenarios-8-policies-each.json", "scenarios-6-policies-each.json", "scenarios-4-policies-each.json"]

# CALCULATOR_TOOL["toolSpec"] references the calculator tool definition from compliance_calculator.py

# Initialize AWS Bedrock clients
# bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name=AWS_REGION)  # For knowledge base retrieval
bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)  # For model inference

In [2]:
def keep_alive():
    while True:
        time.sleep(300)  # 5 minutes
        print(f"Keep alive thread still running... {time.strftime('%Y-%m-%d %H:%M:%S')}")

In [3]:
def bedrock_call_with_retry(func: Callable[[], Any], max_retries: int = MAX_RETRIES_ON_THROTTLE, base_delay: int = 2) -> Any:
    for attempt in range(max_retries):
        try:
            return func()
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                delay = base_delay * (2 ** attempt) # exponential backoff in the event of throttling
                print(f"Rate limit hit, waiting {delay}s...")
                time.sleep(delay)
            else:
                raise
    raise Exception("Max retries exceeded")

In [4]:
def load_scenarios_from_s3(input_bucket: str = BUCKET, s3_source_scenarios: str = S3_SOURCE_SCENARIOS, object_name: str = "scenarios.json") -> List[Dict]:
    """
    Load scenarios from S3 JSON file.
    """
    s3 = boto3.client('s3')
    response = s3.get_object(Bucket=input_bucket, Key=s3_source_scenarios+object_name)
    json_data = json.loads(response['Body'].read().decode('utf-8'))
    return json_data["scenarios"]


In [5]:
def save_file_to_s3(file_path: Path, output_bucket: str = BUCKET, s3_key: str = None):
    """
    Upload a local file to S3.
    """
    s3 = boto3.client('s3')
    s3.upload_file(str(file_path), output_bucket, s3_key)
    print(f"Uploaded {file_path} to s3://{output_bucket}/{s3_key}")


In [6]:
def retrieve_policies_by_id(bucket:str, folder:str, policy_ids: List[str]) -> str:
    """
    Retrieve specific policy documents from s3.
    """

    s3 = boto3.client('s3')
    policies = []
    for policy_id in policy_ids:
        response = s3.get_object(Bucket=bucket, Key=folder + policy_id + ".md")
        content = response['Body'].read().decode('utf-8')
        policies.append(f"{policy_id}:\n{content}")
    
    return "\n\n".join(policies)
    

In [7]:
def get_policies_for_scenario(scenario: Dict) -> str:
    """Extract policy IDs from scenario and retrieve policy content from S3"""
    
    # Extract policy IDs from scenario text
    policy_match = re.search(r'Policies referenced: (.+)', scenario["scenario-detail"])
    if not policy_match:
        print(f"No policies referenced in scenario: {scenario['scenario-detail']}")
        return ""
    
    # Parse policy IDs
    policy_ids = [p.strip() for p in policy_match.group(1).split(',')]
    
    # Retrieve policy documents from S3
    s3 = boto3.client('s3')
    policies = []
    for policy_id in policy_ids:
        response = s3.get_object(Bucket=BUCKET, Key=S3_SOURCE_POLICY_ALL + policy_id + ".md")
        content = response['Body'].read().decode('utf-8')
        policies.append(f"{policy_id}:\n{content}")
    
    return "\n\n".join(policies)
    

In [8]:
def judge_with_claude(
    prompt: str,
    model_id: str,
    temperature: float
) -> Dict:
    
    messages=[{"role": "user", "content": [{"text": prompt}]}]
    input_tokens = 0
    output_tokens = 0
    result = {}
    
    while True:
        response = bedrock_call_with_retry(
            lambda: bedrock_runtime.converse(modelId=model_id, messages=messages,toolConfig=TOOL_CONFIG,
                inferenceConfig={
                    "maxTokens": MAX_TOKENS,
                    "temperature": temperature 
                }
            )
        )

        # track per-scenario token usage
        usage = response.get('usage', {})
        input_tokens += usage.get('inputTokens', 0)
        output_tokens += usage.get('outputTokens', 0)
        
        if response['stopReason'] == 'tool_use':
            tool_results = []
            for content_block in response['output']['message']['content']:
                if 'toolUse' in content_block:
                    tool_name = content_block['toolUse']['name']
                    tool_use_id = content_block['toolUse']['toolUseId']
                    
                    if tool_name == 'compliance_calculator':
                        expression = content_block['toolUse']['input']['expression']                         
                        calc_result = compliance_calculator(expression)
                        # print("=" * 60)
                        # print(f"Compliance calculator expression: {expression}")
                        # print(f"Compliance calculator result: {calc_result}")
                        # print("=" * 60)
                        tool_results.append({
                            "toolResult": {
                                "toolUseId": tool_use_id,
                                "content": [{"text": calc_result}]
                            }
                        })
                    elif tool_name == 'judged_scenario_json':
                        tool_result = content_block['toolUse']['input']
                         # Parse JSON if it's a string
                        if isinstance(tool_result, str):
                            tool_result = json.loads(tool_result)
                        result["judged-compliant"] = tool_result['scenarios'][0]['judged-compliant']
                        result["judged-compliant-reason"] = tool_result['scenarios'][0]['judged-compliant-reason']
                        break
            
            if tool_results:
                messages.append({"role": "assistant", "content": response['output']['message']['content']})
                messages.append({"role": "user", "content": tool_results})
            else:
                break
        else:
            break

    result.update({
        "llm-judge-input-tokens" : input_tokens,
        "llm-judge-output-tokens": output_tokens,
        "llm-judge-total-tokens" : input_tokens + output_tokens
    })
        
    return result
    


In [9]:
def judge_with_nova(prompt: str, model_id: str, temperature: float) -> Dict:
    enhanced_prompt = f"""{prompt}

Respond with a JSON object in this exact format:
{{
  "judged-compliant": true/false,
  "judged-compliant-reason": "detailed explanation"
}}"""
    
    body = {
        "messages": [{"role": "user", "content": [{"text": enhanced_prompt}]}],
        "inferenceConfig": {
            "maxTokens": MAX_TOKENS,
            "temperature": temperature
        }
    }
    
    response = bedrock_call_with_retry(
        lambda: bedrock_runtime.invoke_model(
            modelId=model_id,
            body=json.dumps(body),
            contentType="application/json"
        )
    )
    
    response_body = json.loads(response['body'].read())
    content = response_body['output']['message']['content'][0]['text']
    
    # Extract JSON from response
    try:
        # Find JSON in response text
        start = content.find('{')
        end = content.rfind('}') + 1
        json_str = content[start:end]
        result = json.loads(json_str)
        
        return {
            "judged-compliant": result["judged-compliant"],
            "judged-compliant-reason": result["judged-compliant-reason"],
            "llm-judge-input-tokens": response_body.get('usage', {}).get('inputTokens', 0),
            "llm-judge-output-tokens": response_body.get('usage', {}).get('outputTokens', 0),
            "llm-judge-total-tokens": response_body.get('usage', {}).get('inputTokens', 0) + response_body.get('usage', {}).get('outputTokens', 0)
        }
    except (json.JSONDecodeError, KeyError) as e:
        return {
            "judged-compliant": False,
            "judged-compliant-reason": f"Error parsing Nova response: {str(e)}",
            "llm-judge-input-tokens": 0,
            "llm-judge-output-tokens": 0,
            "llm-judge-total-tokens": 0
        }


In [10]:
def judge_scenarios(
    source_scenarios: List[Dict],
    model_arn: str, 
    temperature: float = None,
    num_scenarios: int = None
) -> List[Dict]:
    
    # Extract model ID and detect model type
    model_id = model_arn.split('/')[-1] if '/' in model_arn else model_arn
    is_nova_model = "nova" in model_id.lower()
    
    judged_scenarios = []
    for scenario in source_scenarios[:num_scenarios] if num_scenarios else source_scenarios:
        
        # Get policy context from S3 for this scenario
        retrieved_policies = get_policies_for_scenario(scenario)
        if not retrieved_policies:
            continue  # Skip scenarios without policy references
        
        prompt = COMPLIANCE_JUDGE_PROMPT.format(
            retrieved_policies=retrieved_policies,
            scenario_detail=scenario["scenario-detail"]
        )
        
        if is_nova_model:
            # Nova: Use invoke_model with JSON response
            judged_scenario_detail = judge_with_nova(prompt, model_id, temperature)
            print("Calling judge_with_nova")
        else:
            # Claude: Use converse with tools
            judged_scenario_detail = judge_with_claude(prompt, model_id, temperature)
            print("Calling judge_with_claude")
        
        # Add metadata
        judged_scenario = scenario.copy()
        judged_scenario.update(judged_scenario_detail)
        judged_scenario.update({
            "judged-dtm": datetime.datetime.now().isoformat(),
            "llm-judge": model_id,
            "llm-judge-temp": temperature
        })
        
        judged_scenarios.append(judged_scenario)
    
    return judged_scenarios


In [11]:
def judge_scenarios_streaming(
    source_scenarios: List[Dict],
    model_arn: str, 
    temperature: float = None,
    num_scenarios: int = None,
    start_index: int = 0,
    output_file: Path = None
) -> List[Dict]:
    
    model_id = model_arn.split('/')[-1] if '/' in model_arn else model_arn
    is_nova_model = "nova" in model_id.lower()
    
    # Initialize file with opening bracket
    output_file.parent.mkdir(parents=True, exist_ok=True)
    with open(output_file, 'w') as f:
        f.write('{"scenarios":[')
    
    judged_scenarios = []
    
    if num_scenarios is not None:
        scenarios_to_process = source_scenarios[start_index:start_index + num_scenarios]
    else:
        scenarios_to_process = source_scenarios[start_index:]
    
    for i, scenario in enumerate(scenarios_to_process):
        retrieved_policies = get_policies_for_scenario(scenario)
        if not retrieved_policies:
            continue
        
        prompt = COMPLIANCE_JUDGE_PROMPT.format(
            retrieved_policies=retrieved_policies,
            scenario_detail=scenario["scenario-detail"]
        )
        
        if is_nova_model:
            judged_scenario_detail = judge_with_nova(prompt, model_id, temperature)
        else:
            judged_scenario_detail = judge_with_claude(prompt, model_id, temperature)
        
        judged_scenario = scenario.copy()
        judged_scenario.update(judged_scenario_detail)
        judged_scenario.update({
            "judged-dtm": datetime.datetime.now().isoformat(),
            "llm-judge": model_id,
            "llm-judge-temp": temperature
        })
        
        judged_scenarios.append(judged_scenario)
        
        # Append to file immediately
        with open(output_file, 'a') as f:
            if i > 0:  # Add comma before all but first scenario
                f.write(',')
            json.dump(judged_scenario, f)
        
        print(f"Saved scenario {i+1}/{len(scenarios_to_process)}")
    
    # Close the JSON structure
    with open(output_file, 'a') as f:
        f.write(']}')
    
    return judged_scenarios


In [12]:
def save_scenarios_to_file(scenarios: List[Dict], output_path: Path):
    
    # Print scenarios to console for immediate review
    print(json.dumps(scenarios, indent=2))

    # Create parent directories if they don't exist
    output_path.parent.mkdir(parents=True, exist_ok=True)
    
    # Save to file with metadata and statistics
    with open(output_path, 'w') as f:
        json.dump({
            'total_scenarios': len(scenarios),
            'compliant_count': sum(1 for s in scenarios if s['is-compliant']),
            'non_compliant_count': sum(1 for s in scenarios if not s['is-compliant']),
            'judged_compliant_count': sum(1 for s in scenarios if s['judged-compliant']),
            'judged_non_compliant_count': sum(1 for s in scenarios if not s['judged-compliant']),
            'scenarios': scenarios
        }, f, indent=2)

In [13]:
def main(
    models=None,
    scenarios=None,
    num_scenarios: int = None,
    start_index: int = 0,
):

    # Start keep-alive thread
    # daemon=True parameter makes the thread automatically stop when the main program ends
    thread = threading.Thread(target=keep_alive, daemon=True)
    thread.start()
    
    if not models or not scenarios:
        print("Error: Both models and scenarios must be provided")
        return

    # Get model names for validation
    model_names = [m['name'] for m in MODELS]
    
    # Handle "all" and convert to lists
    models = model_names if str(models).lower() == "all" else [models] if isinstance(models, str) else models
    scenarios = SCENARIOS if str(scenarios).lower() == "all" else [scenarios] if isinstance(scenarios, str) else scenarios
    
    # Validate models
    if invalid := [m for m in models if m not in model_names]:
        print(f"Invalid models: {invalid}. Valid: {model_names}")
        return

    
    # Print summary
    print("Processing Summary:")
    print("=" * 100)
    for model_name in models:
        model = next(m for m in MODELS if m['name'] == model_name)
        print(f"Model: {model_name} (temp: {model['temperature']})")
    for scenario_file in scenarios:
        print(f"Scenario file: {scenario_file}")
    print(f"Number of scenarios: {num_scenarios}")
    print(f"Starting at scenario index: {start_index}")
    print("=" * 100)
    
    # Wait for user confirmation
    input("Press Enter to proceed...")

    judged_scenarios_batch = ""
    
     # Run all specified combinations
    for model_name in models:
        for scenario_file in scenarios:
            try:    
                model = next(m for m in MODELS if m['name'] == model_name)
                source_scenarios = load_scenarios_from_s3(BUCKET, S3_SOURCE_SCENARIOS, scenario_file)
                print(f"Processing: {model_name} with {scenario_file}")
            
                judged_scenarios_batch = f"judged_scenarios_batch-{scenario_file.split('.')[0]}-{model['name']}-temp{model['temperature']}"
                print ("=" * 100)
                print(f"Starting batch: {judged_scenarios_batch}")
                print ("=" * 100)

                judged_scenarios_file = f"{judged_scenarios_batch}.json"
                local_file_path: Path = FOLDER_JUDGED_SCENARIOS / judged_scenarios_file
                s3_key = S3_JUDGED_SCENARIOS + judged_scenarios_file
                
                judged_scenarios = judge_scenarios_streaming(
                    source_scenarios = source_scenarios,
                    model_arn = model['arn'],
                    temperature = model['temperature'],
                    num_scenarios = num_scenarios,
                    start_index = start_index,
                    output_file = local_file_path
                )

                save_file_to_s3(local_file_path, BUCKET, s3_key)
                
                print ("=" * 100)
                print(f"Finished batch: {judged_scenarios_batch}: {len(judged_scenarios)} scenarios processed")
                print ("=" * 100)
                
            except Exception as e:
                print(f"Error processing batch {judged_scenarios_batch}: {str(e)}")
                import traceback
                traceback.print_exc()  # show  full error trace


In [None]:
# main("all", "all")
main(["nova_premier-t0.0", "nova_premier-t0.1"], "scenarios-8-policies-each.json", 200, 200)


Processing Summary:
Model: nova_premier-t0.0 (temp: 0.0)
Model: nova_premier-t0.1 (temp: 0.1)
Scenario file: scenarios-8-policies-each.json
Number of scenarios: 200
Starting at scenario index: 200


Press Enter to proceed... 


Processing: nova_premier-t0.0 with scenarios-8-policies-each.json
Starting batch: judged_scenarios_batch-scenarios-8-policies-each-nova_premier-t0.0-temp0.0
Saved scenario 1/200
Saved scenario 2/200
Saved scenario 3/200


In [None]:
bedrock_client = boto3.client('bedrock', region_name=AWS_REGION)
response = bedrock_client.list_inference_profiles()
for profile in response['inferenceProfileSummaries']:
    if 'meta' in profile['inferenceProfileName'].lower():
        print(f"ARN: {profile['inferenceProfileArn']}")
