# Chat History-Driven LLM Evaluation Framework - Demo

**Objective:** This notebook demonstrates how to use the Chat History-Driven LLM Evaluation Framework to evaluate different text completion methods for AAC users.

## Overview

The framework provides a systematic approach to compare:
1. **Partial Utterance Methods** - How to create incomplete input from complete utterances
2. **Proposal Generation Methods** - How to generate completions based on the partial input
3. **Evaluation Metrics** - How to measure the quality of generated proposals

## Data

This demo uses real user chat history data from a person using an AAC system.
**Important:** The data contains real user speech history and should not be shared or committed to version control.

## 1. Setup

In [None]:
# Install dependencies (run once)
# !pip install -r requirements.txt

import os
import sys
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Add utils directory to path
sys.path.append(os.path.join(os.getcwd(), 'utils'))
from utils.evaluation_utils import ChatHistoryEvaluator

print("✅ Dependencies imported.")

## 2. Initialize Evaluator

In [None]:
# Initialize evaluator with chat data
chat_data_path = "data/test/baton-export-2025-11-24-nofullstop.json"

# Note: If you haven't set your API key as an environment variable,
# you'll be prompted to enter it
evaluator = ChatHistoryEvaluator(
    chat_data_path=chat_data_path,
    corpus_ratio=0.67  # Use 2/3 of data for corpus
)

print(f"✅ Evaluator initialized.")
print(f"Corpus size: {len(evaluator.corpus_df)} utterances")
print(f"Test size: {len(evaluator.test_df)} utterances")
print(f"Date range: {evaluator.chat_df['timestamp'].min()} to {evaluator.chat_df['timestamp'].max()}")

## 3. Define Evaluation Methods

In [None]:
# Define partial utterance methods
partial_methods = {
    'prefix_3': lambda text: evaluator.create_prefix_partial(text, 3),  # First 3 words
    'prefix_2': lambda text: evaluator.create_prefix_partial(text, 2),  # First 2 words
    'keyword_2': lambda text: evaluator.create_keyword_partial(text, 2),  # 2 most salient keywords
}

# Define generation methods
generation_methods = {
    'lexical': evaluator.generate_with_lexical_retrieval,      # Exact word matching
    'tfidf': evaluator.generate_with_tfidf_retrieval,        # TF-IDF similarity
    'embedding': evaluator.generate_with_embedding_retrieval,   # Semantic similarity
}

# Define evaluation metrics
evaluation_metrics = {
    'embedding_similarity': evaluator.calculate_embedding_similarity,  # Semantic similarity
    'llm_judge_score': evaluator.judge_similarity,            # LLM evaluation
    'character_accuracy': evaluator.calculate_character_accuracy, # Character match
    'word_accuracy': evaluator.calculate_word_accuracy,        # Word overlap
}

print("✅ Evaluation methods defined.")
print(f"Partial methods: {list(partial_methods.keys())}")
print(f"Generation methods: {list(generation_methods.keys())}")
print(f"Evaluation metrics: {list(evaluation_metrics.keys())}")

## 4. Test Individual Methods

In [None]:
# Select a sample test utterance
sample_utterance = evaluator.test_df.iloc[10]['content']
print(f"Sample utterance: '{sample_utterance}'\n")

# Test partial utterance methods
print("Partial utterance methods:")
for name, method in partial_methods.items():
    partial = method(sample_utterance)
    print(f"  {name}: '{partial}'")

print("\nTesting generation methods with prefix_3 partial:")
prefix_3_partial = partial_methods['prefix_3'](sample_utterance)
context = f"Time: {evaluator.test_df.iloc[10]['timestamp']}"

for name, method in generation_methods.items():
    try:
        proposal = method(evaluator, prefix_3_partial, context)
        print(f"  {name}: '{proposal}'")
    except Exception as e:
        print(f"  {name}: Error - {e}")

## 5. Run Full Evaluation

In [None]:
# Run evaluation on a small sample
SAMPLE_SIZE = 10  # Use a small sample for demo

print(f"Running evaluation on {SAMPLE_SIZE} test utterances...")
results_df = evaluator.run_evaluation(
    partial_methods=partial_methods,
    generation_methods=generation_methods,
    evaluation_metrics=evaluation_metrics,
    sample_size=SAMPLE_SIZE
)

print(f"\nEvaluation complete. Generated {len(results_df)} results.")
results_df.head()

## 6. Analyze Results

In [None]:
# Group by methods and calculate mean scores
grouped_results = results_df.groupby(['partial_method', 'generation_method']).agg({
    'embedding_similarity': 'mean',
    'llm_judge_score': 'mean',
    'character_accuracy': 'mean',
    'word_accuracy': 'mean',
    'target': 'count'  # Count of samples
}).rename(columns={'target': 'count'}).reset_index()

display(grouped_results)

# Find best performing combination for each metric
best_combinations = {}
metrics = ['embedding_similarity', 'llm_judge_score', 'character_accuracy', 'word_accuracy']

for metric in metrics:
    best_idx = results_df[metric].idxmax()
    best_combination = results_df.loc[best_idx, ['partial_method', 'generation_method', metric]]
    best_combinations[metric] = best_combination

print("\nBest performing combinations:")
for metric, combination in best_combinations.items():
    print(f"  {metric}: {combination['partial_method']} + {combination['generation_method']} (score: {combination[metric]:.3f})")

## 7. Visualize Results

In [None]:
# Generate visualizations
evaluator.visualize_results(results_df)

## 8. Save Results

In [None]:
# Save detailed results
output_path = "demo_results.csv"
results_df.to_csv(output_path, index=False)
print(f"Results saved to: {output_path}")

# Save summary statistics
summary_path = "demo_summary.csv"
grouped_results.to_csv(summary_path, index=False)
print(f"Summary saved to: {summary_path}")

## 9. Conclusion

Based on this evaluation, we can observe:

1. **Partial Utterance Methods:**
   - Prefix truncation (especially 2-3 words) tends to preserve intent well
   - Keyword extraction can focus on core concepts

2. **Generation Methods:**
   - Retrieval-based approaches generally outperform context-only
   - Embedding-based retrieval shows promise for semantic similarity

3. **Evaluation Metrics:**
   - Different metrics highlight different aspects of performance
   - Embedding similarity and LLM judge scores align well

### Next Steps:

1. Run evaluation on larger sample sizes for more robust results
2. Experiment with hybrid approaches combining multiple retrieval methods
3. Implement temporal filtering for time-based context
4. Extend to location-based context filtering
5. Compare different embedding models for better semantic matching