# GPT-4o Documentation Scoring Pipeline

This script evaluates documentation quality generated by various code models using **OpenAI's GPT-4o** model. It scores each documented output based on a shared rubric and stores the results in a structured folder.

---

## 🧠 What It Does

1. **Loads a dataset** of:
   - Original code (`code`)
   - Documentation generated by different LLMs

2. **Combines each pair** of original and documented code with a standardized evaluation prompt from a markdown file.

3. **Sends the prompt to OpenAI's GPT-4o** using the `responses.create()` method.

4. **Parses GPT-4o's output** to extract:
   - `final_score` (0–14)
   - `reason` (explanation for the score)

5. **Saves results per model** in CSV files inside a structured output folder.

---

## 📂 Input Files

- `documented_code_responses_all_incremental.csv`  
  The dataset containing code and documentation outputs to evaluate.

- `engineered_prompt-scoring.md`  
  The evaluation rubric to guide GPT-4o's reasoning.

---

## 📁 Output Directory: `GPT4o-Scoring/`

Each model gets its own CSV file containing:

```
GPT4o-Scoring/
├── evaluation_codellama_70b.csv
├── evaluation_qwen2.5-coder_32b.csv
├── ...
```

Each CSV includes:
- Original code + documentation
- `<model>_score`: GPT-4o’s assigned score
- `<model>_reason`: GPT-4o’s explanation for the score

---

## 🛠 Evaluated Models

The following models are evaluated:

- `qwen2.5-coder:32b`
- `codellama:70b`
- `deepseek-coder:33b`
- `codegemma:7b`
- `codestral`

---

## ⚙️ API Usage & Rate Handling

- Utilizes `openai.responses.create()` with `gpt-4o`
- Waits 10 seconds between requests
- Implements retries with exponential backoff for:
  - Rate limits (HTTP 429)
  - Temporary API failures

---

In [2]:
# Cell 1: Setup and Import Dependencies
import os
import pandas as pd
import openai
import json
import re
import time
import random

# Set your OpenAI API key.
# Option 1: Set the OPENAI_API_KEY environment variable outside of Python.
# Option 2: Directly assign your API key below (replace 'YOUR_API_KEY' with your actual key).
if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = "YOUR_API_SECRET_KEY"

# Assign the API key to openai.api_key for consistency.
openai.api_key = os.environ.get("OPENAI_API_KEY")

# Define the CSV file path (ensure this file is in your current directory)
CSV_FILE_PATH = 'documented_code_responses_all_incremental.csv'


In [3]:
# Cell 2: Load Evaluation Prompt from prompt.md
prompt_file_path = "engineered_prompt-scoring.md"
if os.path.exists(prompt_file_path):
    with open(prompt_file_path, "r", encoding="utf-8") as f:
        evaluation_prompt = f.read()
    print("Evaluation prompt successfully loaded from prompt.md.")
else:
    raise FileNotFoundError("prompt.md file not found in the current directory.")


Evaluation prompt successfully loaded from prompt.md.


In [4]:
# Cell 3: Define Helper Functions

def extract_json(text):
    """
    Attempts to extract a JSON object containing the keys "final_score" and "reason"
    from the provided text. Returns the parsed dictionary if found, else returns None.
    """
    try:
        obj = json.loads(text)
        if "final_score" in obj and "reason" in obj:
            return obj
    except Exception:
        pass
    
    # Use regex to search for a JSON substring.
    pattern = r'(\{[^}]*"final_score"[^}]*"reason"[^}]*\})'
    matches = re.findall(pattern, text, re.DOTALL)
    for match in matches:
        try:
            obj = json.loads(match)
            if "final_score" in obj and "reason" in obj:
                return obj
        except Exception:
            continue
    return None

def get_evaluation(original_code, documented_code):
    """
    Combines the evaluation prompt with "#Original code:" and "#Documented code:" sections,
    then calls the OpenAI Responses API using model "gpt-4o"—enforcing via instructions
    that the output is a JSON object with exactly two keys: "final_score" and "reason".
    If a rate-limit error (429) occurs, this function waits between 5 to 10 seconds and retries.
    Returns the parsed JSON as a dictionary.
    """
    full_prompt = f"""{evaluation_prompt}

#Original code:
{original_code}

#Documented code:
{documented_code}
"""
    
    max_attempts = 10  # Maximum number of retries before giving up.
    attempt = 0

    while attempt < max_attempts:
        attempt += 1
        try:
            # Create the client and call the API.
            client = openai.OpenAI(api_key=openai.api_key)
            response = client.responses.create(
                model="gpt-4o",
                instructions=(
                    "You are a Documentation Evaluation Expert. Evaluate the provided documented code strictly based on the rubric. "
                    "Output a JSON object containing exactly two keys: 'final_score' (an integer from 0 to 14) and 'reason' "
                    "(a detailed explanation). Do not include any additional text."
                ),
                input=full_prompt
            )
            output_text = response.output_text
            evaluation = extract_json(output_text)
            if evaluation is None:
                print("Warning: Could not extract valid JSON. Raw response:")
                print(output_text)
                return {"final_score": None, "reason": "No valid JSON output found."}
            return evaluation
        except Exception as e:
            error_message = str(e)
            if "429" in error_message or "rate_limit" in error_message.lower():
                wait_time = random.uniform(60, 100)
                print(f"Rate limit encountered on attempt {attempt}/{max_attempts}. Waiting for {wait_time:.2f} seconds...")
                time.sleep(wait_time)
            else:
                print(f"Error when calling OpenAI API on attempt {attempt}/{max_attempts}: {error_message}")
                # Wait a short period for non-rate-limit errors as well.
                time.sleep(5)
    return {"final_score": None, "reason": "Max retries exceeded due to rate limit or other API errors."}


In [5]:
# Create an output directory for GPT4o evaluation results
output_dir = "GPT4o-Scoring"
os.makedirs(output_dir, exist_ok=True)

# Define the documentation model columns from your CSV
documentation_models = [
    "qwen2.5-coder:32b",
    "codellama:70b",
    "deepseek-coder:33b",
    "codegemma:7b",
    "codestral"
]

# Load the CSV into a DataFrame
df = pd.read_csv(CSV_FILE_PATH)
print(f"Loaded {len(df)} rows from {CSV_FILE_PATH}")

# Process each documentation model
for model in documentation_models:
    print(f"\nStarting evaluation for model: {model}")
    eval_df = df.copy()  # Copy the DataFrame for evaluation results
    scores = []
    reasons = []
    total_rows = len(eval_df)

    for idx, row in eval_df.iterrows():
        original_code = row["code"]
        documented_code = row[model]

        evaluation = get_evaluation(original_code, documented_code)
        score = evaluation.get("final_score")
        reason = evaluation.get("reason")
        scores.append(score)
        reasons.append(reason)

        print(f"Row {idx+1}/{total_rows} for {model}: Score = {score}")
        print(f"  Extracted Reason: {reason}")
        time.sleep(10)  # Delay to avoid rate-limiting

    # Append evaluation results to DataFrame
    score_col = f"{model}_score"
    reason_col = f"{model}_reason"
    eval_df[score_col] = scores
    eval_df[reason_col] = reasons

    # Save to the GPT4o-Scoring folder
    safe_model_name = model.replace(":", "_")
    output_filename = os.path.join(output_dir, f"evaluation_{safe_model_name}.csv")
    eval_df.to_csv(output_filename, index=False)
    print(f"Saved evaluation results for model {model} to {output_filename}")

print("\nAll evaluations complete!")


Loaded 125 rows from documented_code_responses_all_incremental.csv

Starting evaluation for model: qwen2.5-coder:32b
Row 1/125 for qwen2.5-coder:32b: Score = 12
  Extracted Reason: The documented code includes an overall file summary (Presence: Yes, Content: Yes) and explains the purpose of each code block. However, the inputs/arguments and outputs/return values are not explicitly detailed (No for both), although they can be inferred to some extent. There is a step-by-step explanation provided for the code blocks, and the correct syntax and placement for comments are used. Code functionality and structure are untouched. Original code sections are fully preserved.
Row 2/125 for qwen2.5-coder:32b: Score = 13
  Extracted Reason: The documentation is comprehensive and meets most criteria. There is an overall file summary that explains the purpose and functionality but lacks deeper detail on the overall design. Each code block is documented with purpose, inputs, outputs, step-by-step explan