# DeepSeek-Based Evaluation

This script evaluates code documentation quality using **DeepSeek-V3** via the OpenAI-compatible API. It scores the output of various LLMs on their ability to generate high-quality documentation for code snippets, and stores results per model in a structured folder.

---

## 🔍 What the Script Does

1. **Loads a CSV dataset** containing:
   - Original code snippets (`code`)
   - Documented code from different models

2. **Combines each pair** of original + documented code with a standard evaluation prompt.

3. **Sends the prompt to DeepSeek (via OpenAI API interface)** using the `deepseek-chat` model.

4. **Parses the returned JSON**, extracting:
   - `final_score`: Numerical score for documentation quality
   - `reason`: Justification for the score

5. **Appends results** to the DataFrame and saves model-specific CSV files.

---

## 📂 Input Files

- `documented_code_responses_all_incremental.csv`  
  Contains original code + LLM-generated documentation.

- `engineered_prompt-scoring.md`  
  Custom evaluation prompt used to guide the scoring process.

---

## 🧪 Evaluated Models

This script evaluates the documentation generated by:

- `qwen2.5-coder:32b`
- `codellama:70b`
- `deepseek-coder:33b`
- `codegemma:7b`
- `codestral`

---

## 📁 Output Directory Structure

Evaluations are saved to the `DeepSeek-Scoring/` folder in the format:

```
DeepSeek-Scoring/
├── evaluation_codellama_70b.csv
├── evaluation_qwen2.5-coder_32b.csv
├── ...
```

Each CSV includes:
- Original columns
- `<model>_score`: Final numeric score from DeepSeek
- `<model>_reason`: Explanation text from DeepSeek

---

## ⚙️ API Usage and Rate Limits

- Uses OpenAI-compatible DeepSeek API at `https://api.deepseek.com/v1`
- Implements retry logic for:
  - API errors
  - Rate limiting (with random backoff)
- Sleeps 20 seconds between evaluations to stay within quota

---

In [1]:
import os
import pandas as pd
import openai
import json
import re
import time
import random


# Set your DeepSeek API key.
# Option 1: Set the DEEPSEEK_API_KEY environment variable outside of Python.
# Option 2: Directly assign your API key below.
if not os.getenv("DEEPSEEK_API_KEY"):
    os.environ["DEEPSEEK_API_KEY"] = "YOUR_API_KEY"

# Assign the DeepSeek API key to openai.api_key for consistency.
openai.api_key = os.environ.get("DEEPSEEK_API_KEY")
# Set DeepSeek's API endpoint (v1 is used to access the DeepSeek API).
openai.api_base = "https://api.deepseek.com/v1"

# Define the CSV file path (ensure this file is in your current directory)
CSV_FILE_PATH = 'documented_code_responses_all_incremental.csv'


In [2]:
# Cell 2: Load Evaluation Prompt from engineered_prompt-scoring.md

prompt_file_path = "engineered_prompt-scoring.md"
if os.path.exists(prompt_file_path):
    with open(prompt_file_path, "r", encoding="utf-8") as f:
        evaluation_prompt = f.read()
    print("Evaluation prompt successfully loaded from engineered_prompt-scoring.md.")
else:
    raise FileNotFoundError("engineered_prompt-scoring.md file not found in the current directory.")


Evaluation prompt successfully loaded from engineered_prompt-scoring.md.


In [5]:
# Cell 3: Define Helper Functions

def extract_json(text):
    """
    Attempts to extract a JSON object containing the keys "final_score" and "reason"
    from the provided text. It first looks for a markdown JSON block (```json ... ```).
    Before parsing, it flattens the JSON string by joining its lines with spaces
    to avoid issues with literal newlines. Returns the parsed dictionary if found;
    otherwise, returns None.
    """
    import json
    import re

    # Ensure the input is a string.
    if not isinstance(text, str):
        if isinstance(text, list):
            text = "\n".join([getattr(item, "text", str(item)) for item in text])
        else:
            text = str(text)
    
    # Try to extract JSON from a markdown code block: ```json ... ```
    pattern_block = r"```json\s*(\{.*?\})\s*```"
    match = re.search(pattern_block, text, re.DOTALL)
    if match:
        json_str = match.group(1)
        # Flatten the JSON string to remove literal newlines.
        json_str = " ".join(json_str.splitlines())
        try:
            obj = json.loads(json_str)
            if "final_score" in obj and "reason" in obj:
                return obj
        except Exception as e:
            print("Exception during json.loads from markdown block:", e)
    
    # Fallback: search for any JSON substring containing the required keys.
    pattern = r'(\{[^}]*"final_score"[^}]*"reason"[^}]*\})'
    matches = re.findall(pattern, text, re.DOTALL)
    for match_str in matches:
        # Flatten each candidate string.
        candidate = " ".join(match_str.splitlines())
        try:
            obj = json.loads(candidate)
            if "final_score" in obj and "reason" in obj:
                return obj
        except Exception:
            continue

    return None

def get_evaluation(original_code, documented_code):
    """
    Combines the evaluation prompt with "#Original code:" and "#Documented code:" sections,
    then calls the DeepSeek API using OpenAI's ChatCompletion.create() method.
    The model used here is "deepseek-chat" (invoking DeepSeek-V3). The function enforces via instructions 
    that the output is a JSON object with exactly two keys: "final_score" and "reason".
    
    It implements retry logic for rate limit or API errors.
    Returns the parsed JSON as a dictionary.
    """
    full_prompt = f"""{evaluation_prompt}

#Original code:
{original_code}

#Documented code:
{documented_code}
"""
    
    # Prepare messages following the OpenAI Chat API format.
    messages = [
        {"role": "user", "content": full_prompt}
    ]
    
    max_attempts = 10  # Maximum number of retries.
    attempt = 0

    while attempt < max_attempts:
        attempt += 1
        try:
            response = openai.ChatCompletion.create(
                model="deepseek-chat",  # Invokes DeepSeek-V3.
                messages=messages,
                stream=False
            )
            output_text = response.choices[0].message.content
            evaluation = extract_json(output_text)
            if evaluation is None:
                print("Warning: Could not extract valid JSON. Raw response:")
                print(output_text)
                return {"final_score": None, "reason": "No valid JSON output found."}
            return evaluation
        except Exception as e:
            error_message = str(e)
            if "429" in error_message or "rate_limit" in error_message.lower():
                wait_time = random.uniform(60, 100)
                print(f"Rate limit encountered on attempt {attempt}/{max_attempts}. Waiting for {wait_time:.2f} seconds...")
                time.sleep(wait_time)
            else:
                print(f"Error when calling DeepSeek API on attempt {attempt}/{max_attempts}: {error_message}")
                time.sleep(5)
    return {"final_score": None, "reason": "Max retries exceeded due to rate limit or other API errors."}


In [7]:
# Create an output directory for DeepSeek evaluation results
output_dir = "DeepSeek-Scoring"
os.makedirs(output_dir, exist_ok=True)

# Define the documentation model columns from your CSV
documentation_models = [
    "qwen2.5-coder:32b",
    "codellama:70b",
    "deepseek-coder:33b",
    "codegemma:7b",
    "codestral"
]

# Load the CSV into a DataFrame
df = pd.read_csv(CSV_FILE_PATH)
print(f"Loaded {len(df)} rows from {CSV_FILE_PATH}")

# Process each documentation model
for model in documentation_models:
    print(f"\nStarting evaluation for model: {model}")
    eval_df = df.copy()  # Copy DataFrame for evaluation results
    scores = []
    reasons = []
    total_rows = len(eval_df)

    for idx, row in eval_df.iterrows():
        original_code = row["code"]
        documented_code = row[model]

        evaluation = get_evaluation(original_code, documented_code)
        score = evaluation.get("final_score")
        reason = evaluation.get("reason")
        scores.append(score)
        reasons.append(reason)

        print(f"Row {idx+1}/{total_rows} for {model}: Score = {score}")
        print(f"  Extracted Reason: {reason}")
        time.sleep(20)  # Delay to handle rate limits

    # Append results as new columns
    score_col = f"{model}_score"
    reason_col = f"{model}_reason"
    eval_df[score_col] = scores
    eval_df[reason_col] = reasons

    # Save to the DeepSeek-Scoring directory
    safe_model_name = model.replace(":", "_")
    output_filename = os.path.join(output_dir, f"evaluation_{safe_model_name}.csv")
    eval_df.to_csv(output_filename, index=False)
    print(f"Saved evaluation results for model {model} to {output_filename}")

print("\nAll evaluations complete!")


Loaded 125 rows from documented_code_responses_all_incremental.csv

Starting evaluation for model: qwen2.5-coder:32b
Row 1/125 for qwen2.5-coder:32b: Score = 12
  Extracted Reason: The documented code includes an overall summary and detailed per-code block documentation, but lacks examples (criterion 5) and does not include the original code section (criteria 6 and 7). The documentation is correctly formatted and placed, and the original code functionality and structure are preserved.
Row 2/125 for qwen2.5-coder:32b: Score = 11
  Extracted Reason: The documented code includes an overall summary, explains the purpose of the code block, details inputs and outputs, provides a step-by-step explanation, and includes an example for the function. It uses correct syntax and placement for documentation and preserves the original code functionality and structure. However, it does not include the original code section, and the overall compliance with documentation instructions is not fully met as