# ECD-Eye POC: Model Evaluation

This notebook analyzes the results of the blind evaluation to determine if the fine-tuned model performs better than the baseline model.

In [None]:
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from scipy import stats
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set plot style
plt.style.use('ggplot')
sns.set_palette("deep")

## 1. Load Evaluation Results

First, let's load the results of the blind evaluation.

In [None]:
# Paths
DATA_DIR = Path("../data")
RESULTS_FILE = DATA_DIR / "evaluation_results.csv"

# Load results
results_df = pd.read_csv(RESULTS_FILE)
print(f"Loaded {len(results_df)} evaluation results")

# Display results
results_df

## 2. Analyze Results

Let's analyze the results to see which model performed better.

In [None]:
# Count preferences
model_counts = results_df["preferred_model"].value_counts()
print("Model Preferences:")
print(model_counts)

# Calculate percentages
model_percentages = model_counts / len(results_df) * 100
print("\nModel Preferences (%)")
print(model_percentages)

In [None]:
# Visualize preferences
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=model_counts.index, y=model_counts.values)
plt.title("Model Preferences")
plt.xlabel("Model")
plt.ylabel("Count")

# Add percentage labels
for i, v in enumerate(model_counts.values):
    ax.text(i, v + 0.1, f"{model_percentages[i]:.1f}%", ha="center")

plt.show()

## 3. Statistical Analysis

Let's perform a binomial test to determine if the fine-tuned model is significantly better than the baseline model.

In [None]:
# Count successes (fine-tuned model preferred)
n_finetuned = model_counts.get("finetuned", 0)
n_total = len(results_df)

# Perform binomial test
p_value = stats.binom_test(n_finetuned, n_total, p=0.5, alternative="greater")

print(f"Fine-tuned model preferred: {n_finetuned}/{n_total} ({n_finetuned/n_total*100:.1f}%)")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("The fine-tuned model is significantly better than the baseline model.")
else:
    print("The difference between models is not statistically significant.")

## 4. Analyze Taglines

Let's analyze the taglines to understand what characteristics the ECD prefers.

In [None]:
# Add word count column
results_df["baseline_word_count"] = results_df["baseline_tagline"].str.split().str.len()
results_df["finetuned_word_count"] = results_df["finetuned_tagline"].str.split().str.len()
results_df["preferred_word_count"] = results_df["preferred_tagline"].str.split().str.len()

# Display word counts
print("Word Count Statistics:")
print(f"Baseline: {results_df['baseline_word_count'].mean():.1f} words on average")
print(f"Fine-tuned: {results_df['finetuned_word_count'].mean():.1f} words on average")
print(f"Preferred: {results_df['preferred_word_count'].mean():.1f} words on average")

In [None]:
# Visualize word counts
plt.figure(figsize=(10, 6))
sns.boxplot(data=results_df[["baseline_word_count", "finetuned_word_count", "preferred_word_count"]])
plt.title("Word Count Distribution")
plt.xlabel("Model")
plt.ylabel("Word Count")
plt.show()

## 5. Compare Taglines

Let's compare the taglines from both models side by side.

In [None]:
# Create a comparison table
comparison_df = results_df[["brief", "baseline_tagline", "finetuned_tagline", "preferred_model"]]
comparison_df

## 6. Generate Demo Slides Content

Let's generate content for the demo slides.

In [None]:
# Generate slide content
print("# ECD-Eye POC Results\n")
print(f"## Fine-tuned model preferred: {n_finetuned}/{n_total} ({n_finetuned/n_total*100:.1f}%)\n")
print(f"p-value: {p_value:.4f}\n")

if p_value < 0.05:
    print("**The fine-tuned model is significantly better than the baseline model.**\n")
else:
    print("**The difference between models is not statistically significant.**\n")

print("## Example Taglines\n")

for _, row in results_df.iterrows():
    print(f"### Brief: {row['brief']}\n")
    print(f"* Baseline: \"{row['baseline_tagline']}\"")
    print(f"* Fine-tuned: \"{row['finetuned_tagline']}\"")
    print(f"* **Preferred: {row['preferred_model'].capitalize()}**\n")

## 7. Conclusion

Based on the analysis above, we can draw the following conclusions:

1. The fine-tuned model was preferred in X out of Y cases (Z%).
2. The p-value of the binomial test is P, which [is/is not] statistically significant.
3. The fine-tuned model tends to generate taglines with [more/fewer] words on average.
4. [Add any other observations about the taglines here.]

Overall, this POC [demonstrates/does not demonstrate] that fine-tuning ChatGPT on ECD-ranked examples can improve the quality of generated taglines compared to using static prompts.