# Aggregated Evaluation Scoring

This script merges and organizes scoring data from multiple evaluation systems applied to code documentation models. It builds a single comprehensive CSV containing all model outputs and associated scores from different scoring systems.

## 🔧 How It Works

1. **Folder Setup**
   - Each scoring system (Anthropic, DeepSeek, Gemini, GPT4o) has its own folder:
     ```
     Anthropic-Scoring/
     DeepSeek-Scoring/
     Gemini-Scoring/
     GPT4o-Scoring/
     ```
   - Each folder contains CSV files named like:  
     `evaluation_<model>.csv`  
     Example: `evaluation_codellama_70b.csv`

2. **Model Extraction**
   - The script identifies documentation model names from the filenames.
   - Scores are extracted from each file using `{model}_score` and `{model}_reason` columns.

3. **Base File Initialization**
   - The first file in the `Anthropic-Scoring` folder is used as a base template.
   - It provides the common fields: `index`, `language`, `code`, and the model outputs.

4. **Scoring Merge**
   - For each scoring system and each model:
     - The corresponding score and reasoning are added as new columns.
     - All data is merged based on the `index` column.

5. **Final Column Organization**
   - Columns are reordered so that each model has its documentation followed by all scoring system results.

6. **Output**
   - ✅ `aggregated_evaluations.csv`: Final CSV with all documentation and scoring data merged and aligned.

## 📦 Output File Structure

Each model's section includes:
- Documented Output
- Anthropic Score + Reason
- DeepSeek Score + Reason
- Gemini Score + Reason
- GPT4o Score + Reason

Example:
```
| index | language | code | codellama:70b | codellama:70b_Anthropic_score | codellama:70b_Anthropic_reason | ... |
```

In [1]:
import os
import pandas as pd

# Define the scoring systems and the corresponding folder names
scoring_systems = {
    "Anthropic": "Anthropic-Scoring",
    "DeepSeek": "DeepSeek-Scoring",
    "Gemini": "Gemini-Scoring",
    "GPT4o": "GPT4o-Scoring"
}

# List of expected documentation model names in the base (order is important for final CSV)
doc_models_order = ["qwen2.5-coder:32b", "codellama:70b", "deepseek-coder:33b", "codegemma:7b", "codestral"]

# Function to derive the documentation model name from file name.
def get_doc_model(filename):
    # remove the prefix and .csv
    model_str = filename.replace("evaluation_", "").replace(".csv", "")
    # if there's an underscore, assume the last underscore separates model and parameter info
    if "_" in model_str:
        parts = model_str.rsplit("_", 1)
        return parts[0] + ":" + parts[1]
    else:
        return model_str

# Create an empty dictionary to store scoring DataFrames by key (scoring system, doc_model)
scoring_dfs = {}

# Process each scoring system folder
for system, folder in scoring_systems.items():
    # List files in the folder
    files = [f for f in os.listdir(folder) if f.endswith(".csv")]
    for f in files:
        doc_model = get_doc_model(f)
        file_path = os.path.join(folder, f)
        df = pd.read_csv(file_path)
        # Expect scoring columns: {doc_model}_score and {doc_model}_reason
        score_col = f"{doc_model}_score"
        reason_col = f"{doc_model}_reason"
        if score_col not in df.columns or reason_col not in df.columns:
            print(f"Warning: {score_col} or {reason_col} not found in {file_path}")
            continue
        # Extract only index and scoring columns
        df_scoring = df[["index", score_col, reason_col]].copy()
        # Rename the columns to include the scoring system prefix
        new_score_col = f"{doc_model}_{system}_score"
        new_reason_col = f"{doc_model}_{system}_reason"
        df_scoring.rename(columns={score_col: new_score_col, reason_col: new_reason_col}, inplace=True)
        # Store in dictionary using key (system, doc_model)
        scoring_dfs[(system, doc_model)] = df_scoring

# Prepare a base DataFrame for common columns
# Assume that the common columns (index, language, code, and all documented outputs) are identical across files.
# We can use one file from one scoring system. For example, use the first file from the Anthropic folder.
anthropic_folder = scoring_systems["Anthropic"]
anthropic_files = [f for f in os.listdir(anthropic_folder) if f.endswith(".csv")]
if not anthropic_files:
    raise Exception("No files found in Anthropic scoring folder.")

# Use the first file to get the base columns
base_file_path = os.path.join(anthropic_folder, anthropic_files[0])
base_df = pd.read_csv(base_file_path)

# We assume base_df contains these columns: index, language, code, and columns for all documented outputs.
base_columns = ["index", "language", "code"] + doc_models_order
# Check if base_columns exist, and if not, adjust by taking what is available
available_base_columns = [col for col in base_columns if col in base_df.columns]
if len(available_base_columns) < 3:
    raise Exception("Base DataFrame does not contain required columns (index, language, code).")
    
base_df = base_df[available_base_columns].drop_duplicates(subset=["index"]).copy()

# Now, merge scoring information from every scoring system and every documentation model.
# Each scoring DataFrame is keyed by "index". We merge them into the base_df.
for key, df_scoring in scoring_dfs.items():
    # Merge on "index" using a left join to keep all evaluation instances
    base_df = pd.merge(base_df, df_scoring, on="index", how="left")

# At this point, base_df has the common columns plus additional scoring columns.
# Next, we want to reorder columns so that for each documentation model we have a block of columns:
# 1. The documented code column (from the base)
# 2. Followed by scoring columns from each scoring system for that model.
final_columns = ["index", "language", "code"]
for model in doc_models_order:
    # Add the documented output column first
    final_columns.append(model)
    # For each scoring system, add score and reason columns.
    for system in ["Anthropic", "DeepSeek", "Gemini", "GPT4o"]:
        score_col = f"{model}_{system}_score"
        reason_col = f"{model}_{system}_reason"
        final_columns.extend([score_col, reason_col])

# It is possible that some of these scoring columns are missing if not all files were present.
final_columns = [col for col in final_columns if col in base_df.columns]

# Reorder the DataFrame columns
final_df = base_df[final_columns]

# Save the final aggregated CSV
final_csv_path = "aggregated_evaluations.csv"
final_df.to_csv(final_csv_path, index=False)
print(f"Aggregated CSV has been saved to {final_csv_path}.")


Aggregated CSV has been saved to aggregated_evaluations.csv.


---

# Inference Metrics Summary

This table summarizes the average inference performance of different code documentation models. Each model's metrics are automatically extracted from CSV files located in the `Documentation Metrics` folder.

## Metrics Explained

- **Avg Prompt1 Time (s)**: Time taken to process the original input code.
- **Avg Prompt2 Time (s)**: Time taken to generate or evaluate documentation.
- **Avg Total Time (s)**: Combined time for both prompts.
- **Avg Prompt1 Speed (tokens/s)**: Token processing speed for the first prompt.
- **Avg Prompt2 Speed (tokens/s)**: Token processing speed for the second prompt.
- **Avg Overall Speed (tokens/s)**: Average of both speeds.

## File Generated

📄 `Inference_metrics.csv`  
This file contains one row per model with all the above metrics. It can be used for comparison and visualization of model efficiency.



In [3]:
import glob

# Define the folder path for all metrics CSV files
metrics_folder = "Documentation Metrics"
metrics_files = glob.glob(os.path.join(metrics_folder, "metrics_*.csv"))

# Load and compute inference metrics for each model file in the folder
def compute_metrics_from_file(file_path):
    model_name = os.path.basename(file_path).replace("metrics_", "").replace(".csv", "")
    df = pd.read_csv(file_path)

    avg_prompt1_time = df["prompt1_eval_duration_sec"].mean()
    avg_prompt2_time = df["prompt2_eval_duration_sec"].mean()
    avg_total_time = avg_prompt1_time + avg_prompt2_time

    avg_prompt1_speed = df["prompt1_tokens_per_sec"].mean()
    avg_prompt2_speed = df["prompt2_tokens_per_sec"].mean()
    avg_total_speed = (avg_prompt1_speed + avg_prompt2_speed) / 2

    return {
        "Model": model_name,
        "Avg Prompt1 Time (s)": round(avg_prompt1_time, 2),
        "Avg Prompt2 Time (s)": round(avg_prompt2_time, 2),
        "Avg Total Time (s)": round(avg_total_time, 2),
        "Avg Prompt1 Speed (tokens/s)": round(avg_prompt1_speed, 2),
        "Avg Prompt2 Speed (tokens/s)": round(avg_prompt2_speed, 2),
        "Avg Overall Speed (tokens/s)": round(avg_total_speed, 2),
    }

# Compute metrics for all models
inference_metrics_all = [compute_metrics_from_file(f) for f in metrics_files]
inference_metrics_df = pd.DataFrame(inference_metrics_all)

# Save the summary CSV
inference_output_path = "Inference_metrics.csv"
inference_metrics_df.to_csv(inference_output_path, index=False)

inference_output_path


'Inference_metrics.csv'