# Anthropic Scoring Script

This script uses **Anthropic's Claude model** to evaluate the quality of documentation generated by various large language models (LLMs). It produces structured evaluations (scores and reasons) and saves them in individual CSV files per model.

---

## 🧠 What It Does

1. **Loads an input dataset** with:
   - Original source code
   - Documented versions from different LLMs

2. **Sends each documented code snippet** (with its original code) to Claude via the Anthropic API.

3. **Receives a JSON response** containing:
   - `final_score` (numerical rating)
   - `reason` (justification for the score)

4. **Stores the results** in separate CSV files for each documentation model inside the `Anthropic-Scoring` folder.

---

## 📂 Input Files

- `documented_code_responses_all_incremental.csv`:  
  The main dataset containing `code` and the documented outputs from models.

- `engineered_prompt-scoring.md`:  
  The evaluation prompt sent to Claude to guide the scoring process.

---

## 🧪 Evaluated Models

The script evaluates the following models:
- `qwen2.5-coder:32b`
- `codellama:70b`
- `deepseek-coder:33b`
- `codegemma:7b`
- `codestral`

---

## 🛠 Output Files

Each model gets its own CSV in the `Anthropic-Scoring/` folder:

```
Anthropic-Scoring/
├── evaluation_codellama_70b.csv
├── evaluation_qwen2.5-coder_32b.csv
├── ...
```

Each CSV includes:
- The original columns from input
- `model_score`: Claude's score for the documented code
- `model_reason`: Claude's reasoning for the score

---

## ⏱ API Considerations

- Claude API rate-limited to **3 calls per minute**
- The script:
  - Retries on failure (up to 10 times)
  - Waits **60 seconds** between each evaluation

---

## ✅ Summary

This pipeline automates the use of Claude for objective and explainable scoring of LLM-generated documentation. It enables benchmarking of documentation quality across different models, with transparent justification for each evaluation.


In [1]:
import os
import pandas as pd
import anthropic
import json
import re
import time
import random

# Set your Anthropic API key.
# Option 1: Set the ANTHROPIC_API_KEY environment variable outside of Python.
# Option 2: Directly assign your API key below (replace 'YOUR_ANTHROPIC_API_KEY' with your actual key).
if not os.getenv("ANTHROPIC_API_KEY"):
    os.environ["ANTHROPIC_API_KEY"] = 'YOUR_ANTHROPIC_API_KEY'

# Instantiate the Anthropic client.
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

# Define the CSV file path (ensure this file is in your current directory)
CSV_FILE_PATH = 'documented_code_responses_all_incremental.csv'

In [2]:
prompt_file_path = "engineered_prompt-scoring.md"
if os.path.exists(prompt_file_path):
    with open(prompt_file_path, "r", encoding="utf-8") as f:
        evaluation_prompt = f.read()
    print("Evaluation prompt successfully loaded from prompt.md.")
else:
    raise FileNotFoundError("prompt.md file not found in the current directory.")

Evaluation prompt successfully loaded from prompt.md.


In [3]:
def extract_json(text):
    """
    Attempts to extract a JSON object containing the keys "final_score" and "reason"
    from the provided text. It first looks for a markdown JSON block (```json ... ```).
    Returns the parsed dictionary if found; otherwise, returns None.
    """
    import json
    import re

    # If the text is not a string, convert it (e.g. if it's a list of objects)
    if not isinstance(text, str):
        # If text is a list, extract each element's 'text' attribute if available.
        if isinstance(text, list):
            text = "\n".join([getattr(item, "text", str(item)) for item in text])
        else:
            text = str(text)
            
    # Try to extract JSON content from a ```json ... ``` code block.
    pattern_block = r"```json\s*(\{.*?\})\s*```"
    match = re.search(pattern_block, text, re.DOTALL)
    if match:
        json_str = match.group(1)
        try:
            obj = json.loads(json_str)
            if "final_score" in obj and "reason" in obj:
                return obj
        except Exception:
            pass

    # Fallback: use regex to search for any JSON substring that contains the required keys.
    pattern = r'(\{[^}]*"final_score"[^}]*"reason"[^}]*\})'
    matches = re.findall(pattern, text, re.DOTALL)
    for match in matches:
        try:
            obj = json.loads(match)
            if "final_score" in obj and "reason" in obj:
                return obj
        except Exception:
            continue

    return None

def get_evaluation(original_code, documented_code):
    """
    Combines the evaluation prompt with sections for the original and documented code,
    then calls the Claude API using the Anthropic client and model "claude-3-5-haiku-20241022".
    Enforces via instructions that the output is a JSON object with exactly two keys: 
    "final_score" and "reason". Implements retry logic for transient errors.
    Returns the parsed JSON as a dictionary.
    """
    full_prompt = f"""{evaluation_prompt}

#Original code:
{original_code}

#Documented code:
{documented_code}
"""
    
    max_attempts = 10  # Maximum number of retries before giving up.
    attempt = 0

    while attempt < max_attempts:
        attempt += 1
        try:
            # Call the Anthropic API with the provided prompt.
            response = client.messages.create(
                model="claude-3-7-sonnet-20250219",
                max_tokens=1024,
                messages=[{"role": "user", "content": full_prompt}]
            )
            output_text = response.content
            
            # If output_text is a list of TextBlocks, extract their text.
            if isinstance(output_text, list):
                output_text = "\n".join([getattr(item, "text", str(item)) for item in output_text])
            
            evaluation = extract_json(output_text)
            if evaluation is None:
                print("Warning: Could not extract valid JSON. Raw response:")
                print(output_text)
                return {"final_score": None, "reason": "No valid JSON output found."}
            return evaluation
        except Exception as e:
            error_message = str(e)
            # Example error handling; modify as necessary based on actual exception details.
            if "rate" in error_message.lower():
                wait_time = random.uniform(120, 130)  # Wait time adjusted for Claude's rate limit.
                print(f"Rate limit encountered on attempt {attempt}/{max_attempts}. Waiting for {wait_time:.2f} seconds...")
                time.sleep(wait_time)
            else:
                print(f"Error when calling Claude API on attempt {attempt}/{max_attempts}: {error_message}")
                time.sleep(5)
    return {"final_score": None, "reason": "Max retries exceeded due to rate limit or other API errors."}


In [4]:
# Create an output directory for the evaluation results
output_dir = "Anthropic-Scoring"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Define the documentation model columns from your CSV.
documentation_models = [
    "qwen2.5-coder:32b",
    "codellama:70b",
    "deepseek-coder:33b",
    "codegemma:7b",
    "codestral"
]

# Load the CSV into a DataFrame.
df = pd.read_csv(CSV_FILE_PATH)
print(f"Loaded {len(df)} rows from {CSV_FILE_PATH}")

# Process each documentation model.
for model in documentation_models:
    print(f"\nStarting evaluation for model: {model}")
    eval_df = df.copy()  # Create a copy of the DataFrame for storing evaluation results.
    scores = []
    reasons = []
    total_rows = len(eval_df)
    
    for idx, row in eval_df.iterrows():
        original_code = row["code"]
        documented_code = row[model]
        
        # Evaluate the documented code using the Claude API.
        evaluation = get_evaluation(original_code, documented_code)
        score = evaluation.get("final_score")
        reason = evaluation.get("reason")
        scores.append(score)
        reasons.append(reason)
        
        print(f"Row {idx+1}/{total_rows} for {model}: Score = {score}")
        print(f"  Extracted Reason: {reason}")
        time.sleep(60)  # Delay to respect Claude's rate limit of 3 calls per minute.
    
    # Append the evaluation results as new columns in the DataFrame.
    score_col = f"{model}_score"
    reason_col = f"{model}_reason"
    eval_df[score_col] = scores
    eval_df[reason_col] = reasons
    
    # Create a safe filename and save the DataFrame into the output directory.
    safe_model_name = model.replace(":", "_")
    output_filename = os.path.join(output_dir, f"evaluation_{safe_model_name}.csv")
    eval_df.to_csv(output_filename, index=False)
    print(f"Saved evaluation results for model {model} to {output_filename}")

print("\nAll evaluations complete!")

Loaded 125 rows from documented_code_responses_all_incremental.csv

Starting evaluation for model: qwen2.5-coder:32b
Row 1/125 for qwen2.5-coder:32b: Score = 12
  Extracted Reason: The documentation is very thorough, but has two issues: 1) No examples with sample input and expected output are provided for the various test cases - examples would be valuable to understand the expected behavior; 2) The documented code includes an additional 'if __name__ == "__main__":' block that wasn't present in the original code, which violates the criterion of preserving the original code without modifications.
Row 2/125 for qwen2.5-coder:32b: Score = 11
  Extracted Reason: The documentation does not fully meet all criteria. The original code section is not included in the documented code (criterion 13). The documented code has been wrapped in ```python and ``` tags, which alters the original structure (criterion 11). Additionally, there is an 'Explanation of the Transformed Code' section at the end t