# Gemini-Based Documentation Evaluation

This script evaluates code documentation quality using **Google's Gemini** LLM via the `google-genai` SDK. Each documentation snippet is scored based on a custom evaluation prompt, and results are saved into separate CSV files per model.

---

## 🔍 What the Script Does

1. **Loads a CSV** containing:
   - Original code snippets
   - Generated documentation from several models

2. **Combines the original and documented code** with a pre-written evaluation prompt.

3. **Sends the combined prompt to Gemini (gemini-2.0-flash)** to get:
   - `final_score`: A numerical quality score
   - `reason`: Explanation for the score

4. **Parses the response** and saves the results into model-specific files in the `Gemini-Scoring` directory.

---

## 📂 Input Files

- `documented_code_responses_all_incremental.csv`:  
  The dataset containing `code` and documented outputs from models.

- `engineered_prompt-scoring.md`:  
  The scoring prompt sent to Gemini. It includes detailed rubric instructions.

---

## 🧪 Models Evaluated

The script processes documentation generated by the following models:

- `qwen2.5-coder:32b`
- `codellama:70b`
- `deepseek-coder:33b`
- `codegemma:7b`
- `codestral`

---

## 📁 Output Folder

CSV files are saved under:

```
Gemini-Scoring/
├── evaluation_codellama_70b.csv
├── evaluation_qwen2.5-coder_32b.csv
├── ...
```

Each file contains:
- The original columns
- `model_score`: Gemini's rating for the documented code
- `model_reason`: Explanation of the score

---

## 🕓 Rate Limits & Retries

- Supports **retries** on API errors and **rate limiting**.
- Waits ~20 seconds between requests (adjustable).
- Max 10 retry attempts per evaluation.

---

## ✅ Purpose

This evaluation pipeline offers a reproducible and explainable framework for scoring LLM-generated documentation using Gemini's reasoning ability.
It’s ideal for benchmarking multiple models against a consistent rubric.

In [1]:
import os
import pandas as pd
import json
import re
import time
import random

# Set your Gemini API key here, either via environment variable or directly.
if not os.getenv("GEMINI_API_KEY"):
    os.environ["GEMINI_API_KEY"] = "YOUR_API_KEY"  # replace with your actual key

# Import the Gemini client from the google-genai package.
from google import genai

# Instantiate the Gemini client.
client = genai.Client(api_key=os.environ.get("GEMINI_API_KEY"))

# Define the CSV file path (ensure this file is in your current working directory)
CSV_FILE_PATH = 'documented_code_responses_all_incremental.csv'


In [2]:
prompt_file_path = "engineered_prompt-scoring.md"
if os.path.exists(prompt_file_path):
    with open(prompt_file_path, "r", encoding="utf-8") as f:
        evaluation_prompt = f.read()
    print("Evaluation prompt successfully loaded from prompt.md.")
else:
    raise FileNotFoundError("engineered_prompt-scoring.md file not found in the current directory.")


Evaluation prompt successfully loaded from prompt.md.


In [3]:
def extract_json(text):
    """
    Attempts to extract a JSON object containing the keys "final_score" and "reason"
    from the provided text. It first looks for a markdown JSON block (```json ... ```).
    Before parsing, it flattens the JSON string by joining its lines with spaces
    to avoid issues with literal newlines. Returns the parsed dictionary if found;
    otherwise, returns None.
    """
    import json
    import re

    # Ensure the input is a string.
    if not isinstance(text, str):
        if isinstance(text, list):
            text = "\n".join([getattr(item, "text", str(item)) for item in text])
        else:
            text = str(text)
    
    # Try to extract JSON from a markdown code block: ```json ... ```
    pattern_block = r"```json\s*(\{.*?\})\s*```"
    match = re.search(pattern_block, text, re.DOTALL)
    if match:
        json_str = match.group(1)
        # Flatten the JSON string to remove literal newlines.
        json_str = " ".join(json_str.splitlines())
        try:
            obj = json.loads(json_str)
            if "final_score" in obj and "reason" in obj:
                return obj
        except Exception as e:
            print("Exception during json.loads from markdown block:", e)
    
    # Fallback: search for any JSON substring containing the required keys.
    pattern = r'(\{[^}]*"final_score"[^}]*"reason"[^}]*\})'
    matches = re.findall(pattern, text, re.DOTALL)
    for match_str in matches:
        # Flatten each candidate string.
        candidate = " ".join(match_str.splitlines())
        try:
            obj = json.loads(candidate)
            if "final_score" in obj and "reason" in obj:
                return obj
        except Exception:
            continue

    return None

def get_evaluation_gemini(original_code, documented_code):
    """
    Combines the evaluation prompt with sections for the original and documented code,
    then calls the Gemini API using the Google Gen AI SDK and model "gemini-2.0-flash".
    The output is expected to be a JSON object with exactly two keys:
    "final_score" and "reason". Implements retry logic for transient errors.
    Returns the parsed JSON as a dictionary.
    """
    full_prompt = f"""{evaluation_prompt}

#Original code:
{original_code}

#Documented code:
{documented_code}
"""
    
    max_attempts = 10  # Maximum number of retries before giving up.
    attempt = 0
    while attempt < max_attempts:
        attempt += 1
        try:
            # Call the Gemini API with the prompt.
            response = client.models.generate_content(
                model="gemini-2.0-flash",
                contents=full_prompt
            )
            output_text = response.text
            
            # If output_text is not a string, convert it.
            if not isinstance(output_text, str):
                if isinstance(output_text, list):
                    output_text = "\n".join([str(item) for item in output_text])
                else:
                    output_text = str(output_text)
            
            evaluation = extract_json(output_text)
            if evaluation is None:
                print("Warning: Could not extract valid JSON. Raw response:")
                print(output_text)
                return {"final_score": None, "reason": "No valid JSON output found."}
            return evaluation

        except Exception as e:
            error_message = str(e)
            # Handle rate limiting or other errors.
            if "rate" in error_message.lower():
                wait_time = random.uniform(30, 40)
                print(f"Rate limit encountered on attempt {attempt}/{max_attempts}. Waiting for {wait_time:.2f} seconds...")
                time.sleep(wait_time)
            else:
                print(f"Error when calling Gemini API on attempt {attempt}/{max_attempts}: {error_message}")
                time.sleep(5)
    return {"final_score": None, "reason": "Max retries exceeded due to rate limit or other API errors."}


In [4]:
# Create an output directory for the Gemini evaluation results.
output_dir = "Gemini-Scoring"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Define the documentation model columns from your CSV.
documentation_models = [
    "qwen2.5-coder:32b",
    "codellama:70b",
    "deepseek-coder:33b",
    "codegemma:7b",
    "codestral"
]

# Load the CSV into a DataFrame.
df = pd.read_csv(CSV_FILE_PATH)
print(f"Loaded {len(df)} rows from {CSV_FILE_PATH}")

# Process each documentation model.
for model in documentation_models:
    print(f"\nStarting evaluation for model: {model}")
    eval_df = df.copy()  # Copy the DataFrame for evaluation results.
    scores = []
    reasons = []
    total_rows = len(eval_df)
    
    for idx, row in eval_df.iterrows():
        original_code = row["code"]
        documented_code = row[model]
        
        evaluation = get_evaluation_gemini(original_code, documented_code)
        score = evaluation.get("final_score")
        reason = evaluation.get("reason")
        scores.append(score)
        reasons.append(reason)
        
        print(f"Row {idx+1}/{total_rows} for {model}: Score = {score}")
        print(f"Extracted Reason: {reason}")
        # Adjust delay as needed – for Gemini, ensure you adhere to its rate limits.
        time.sleep(20)
    
    # Append evaluation results as new columns.
    score_col = f"{model}_score"
    reason_col = f"{model}_reason"
    eval_df[score_col] = scores
    eval_df[reason_col] = reasons
    
    # Save the DataFrame to the Gemini-Scoring directory.
    safe_model_name = model.replace(":", "_")
    output_filename = os.path.join(output_dir, f"evaluation_{safe_model_name}.csv")
    eval_df.to_csv(output_filename, index=False)
    print(f"Saved evaluation results for model {model} to {output_filename}")

print("\nAll evaluations complete!")


Loaded 125 rows from documented_code_responses_all_incremental.csv

Starting evaluation for model: qwen2.5-coder:32b
Row 1/125 for qwen2.5-coder:32b: Score = 11
Extracted Reason: The documented code includes an overall summary and explains the purpose of the code. It also includes step-by-step explanations for each block. However, it misses detailing the inputs/arguments and outputs/return values for each code block comprehensively. While the original code functionality and structure are preserved, and the documentation uses correct syntax and placement, the inclusion of examples is absent. The original code section is also missing and the complete original code wasn't included. Finally, the completeness of original code preservation is missing.
Row 2/125 for qwen2.5-coder:32b: Score = 11
Extracted Reason: The documentation is mostly good, but it is missing some elements. There is no '# Original Code:' section (criterion 6). Original code is not fully preserved, including the original 