# Chat History-Driven LLM Evaluation

**Objective:** To evaluate LLM-based text completion systems for AAC users using real chat history data. This framework allows us to compare different approaches for generating message proposals based on partial utterances.

## Overview

This evaluation framework:
1. Splits data chronologically into corpus (for conditioning) and test sets
2. Generates proposals for partial test utterances using different methods
3. Scores proposals against the true full utterances using multiple metrics

The framework is designed to be extensible, supporting:
- Multiple partial-utterance constructions (prefix truncation, keyword subsets)
- Multiple proposal-generation methods (RAG, semantic search)
- Multiple evaluation metrics (embedding similarity, LLM-judge scoring)

## 1. Setup and Dependencies

In [None]:
# Install dependencies (run once)
# !pip install llm llm-gemini pandas matplotlib seaborn sentence-transformers scikit-learn tenacity

In [None]:
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import re
from typing import List, Dict, Tuple, Callable
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# LLM and embedding libraries
import llm
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from getpass import getpass

# --- API KEY SETUP ---
if "LLM_GEMINI_KEY" not in os.environ:
    os.environ["LLM_GEMINI_KEY"] = getpass("Enter your Gemini API Key: ")

# Configure models
GENERATIVE_MODEL = llm.get_model("gemini-2.0-flash-exp")
JUDGE_MODEL = llm.get_model("gemini-2.0-flash-exp")
EMBEDDING_MODEL = SentenceTransformer('all-MiniLM-L6-v2')

print("✅ Setup Complete.")

## 2. Data Loading and Preprocessing

In [None]:
# Load the chat history data
# IMPORTANT: This file contains real user speech history and should not be committed to GitHub
try:
    with open('../baton-export-2025-11-24-nofullstop.json', 'r') as f:
        chat_data = json.load(f)
    print(f"✅ Loaded {len(chat_data['sentences'])} utterances from chat history.")
except FileNotFoundError:
    print("❌ ERROR: Chat history file not found. Make sure the path is correct.")
    print("Expected path: ../baton-export-2025-11-24-nofullstop.json")

In [None]:
# Convert to DataFrame for easier manipulation
def preprocess_chat_data(chat_data):
    """
    Convert the raw chat data into a pandas DataFrame with additional features.
    """
    sentences = chat_data['sentences']
    
    # Extract relevant fields
    processed_data = []
    
    for sentence in sentences:
        # Get the primary content and metadata
        content = sentence['content']
        
        # Extract timestamp from metadata
        if sentence.get('metadata') and len(sentence['metadata']) > 0:
            # Some entries have multiple metadata entries, use the first one with a timestamp
            timestamp_str = None
            latitude = longitude = None
            
            for meta in sentence['metadata']:
                if 'timestamp' in meta:
                    timestamp_str = meta['timestamp']
                    latitude = meta.get('latitude')
                    longitude = meta.get('longitude')
                    break
                    
            # Convert timestamp to datetime
            if timestamp_str:
                try:
                    # Try parsing with timezone
                    timestamp = datetime.fromisoformat(timestamp_str.replace('Z', '+00:00'))
                except ValueError:
                    # Fallback for different formats
                    timestamp = datetime.strptime(timestamp_str, '%Y-%m-%dT%H:%M:%S.%fZ')
            else:
                timestamp = None
        else:
            timestamp = latitude = longitude = None
        
        # Calculate utterance length in words
        word_count = len(content.split()) if content else 0
        
        processed_data.append({
            'content': content,
            'timestamp': timestamp,
            'latitude': latitude,
            'longitude': longitude,
            'word_count': word_count,
            'uuid': sentence.get('uuid'),
            'anonymous_uuid': sentence.get('anonymousUUID')
        })
    
    df = pd.DataFrame(processed_data)
    
    # Drop rows with missing content or timestamp
    df = df.dropna(subset=['content', 'timestamp'])
    
    # Sort by timestamp (chronological order)
    df = df.sort_values('timestamp').reset_index(drop=True)
    
    # Add time-of-day features
    df['hour'] = df['timestamp'].dt.hour
    df['day_of_week'] = df['timestamp'].dt.dayofweek
    
    return df

# Process the data
chat_df = preprocess_chat_data(chat_data)
print(f"✅ Processed data: {len(chat_df)} utterances with timestamps.")
print(f"Date range: {chat_df['timestamp'].min()} to {chat_df['timestamp'].max()}")
print(f"Average utterance length: {chat_df['word_count'].mean():.1f} words")

## 3. Splitting the Data

In [None]:
# Split the data chronologically
def split_data(df, corpus_ratio=0.67):
    """
    Split the data into corpus (for conditioning) and test sets chronologically.
    
    Args:
        df: DataFrame with chat data
        corpus_ratio: Fraction of data to use as corpus (default: 0.67 = 2/3)
    
    Returns:
        corpus_df: First portion of data for retrieval/RAG
        test_df: Final portion of data for evaluation
    """
    split_idx = int(len(df) * corpus_ratio)
    
    corpus_df = df.iloc[:split_idx].reset_index(drop=True)
    test_df = df.iloc[split_idx:].reset_index(drop=True)
    
    return corpus_df, test_df

# Split the data
corpus_df, test_df = split_data(chat_df)
print(f"Corpus (for conditioning): {len(corpus_df)} utterances")
print(f"Test (for evaluation): {len(test_df)} utterances")

# Visualize the split
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(corpus_df['timestamp'], range(len(corpus_df)), 'b-', label='Corpus')
plt.plot(test_df['timestamp'], range(len(test_df)), 'r-', label='Test')
plt.xlabel('Date')
plt.ylabel('Utterance Index')
plt.title('Chronological Distribution of Data')
plt.legend()

plt.subplot(1, 2, 2)
plt.pie([len(corpus_df), len(test_df)], labels=['Corpus', 'Test'], autopct='%1.1f%%')
plt.title('Data Split Ratio')

plt.tight_layout()
plt.show()

## 4. Partial Utterance Generation

In [None]:
# Different methods to create partial utterances
def create_prefix_partial(text, n_words=3):
    """
    Create a partial utterance by taking the first N words.
    
    Args:
        text: Original full utterance
        n_words: Number of words to include in the partial utterance
    
    Returns:
        Partial utterance consisting of the first N words
    """
    words = text.split()
    if len(words) <= n_words:
        return text
    return ' '.join(words[:n_words])

def create_keyword_partial(text, n_keywords=2):
    """
    Create a partial utterance by extracting the most salient keywords.
    Uses TF-IDF to identify important words.
    
    Args:
        text: Original full utterance
        n_keywords: Number of keywords to include in the partial utterance
    
    Returns:
        Partial utterance consisting of the most important keywords
    """
    words = text.split()
    if len(words) <= n_keywords:
        return text
        
    # Simple keyword extraction using word frequency
    word_freq = Counter([word.lower() for word in words if len(word) > 2])
    
    # Get the most frequent words
    top_words = [word for word, count in word_freq.most_common(n_keywords)]
    
    # Find these words in the original order
    partial_words = []
    for word in words:
        if word.lower() in top_words and len(partial_words) < n_keywords:
            partial_words.append(word)
    
    # If we couldn't find enough unique words, fall back to first N words
    if len(partial_words) < n_keywords:
        return create_prefix_partial(text, n_keywords)
        
    return ' '.join(partial_words)

def create_random_partial(text, min_words=1, max_words=3):
    """
    Create a partial utterance by selecting random words.
    
    Args:
        text: Original full utterance
        min_words: Minimum number of words to include
        max_words: Maximum number of words to include
    
    Returns:
        Partial utterance with randomly selected words
    """
    words = text.split()
    n_words = min(len(words), np.random.randint(min_words, max_words + 1))
    
    # Select random indices
    indices = np.random.choice(len(words), size=n_words, replace=False)
    indices = sorted(indices)  # Maintain original order
    
    return ' '.join([words[i] for i in indices])

# Test the partial utterance functions
test_utterance = "I need to adjust my neck brace because it's uncomfortable"
print(f"Original: {test_utterance}")
print(f"Prefix (3 words): {create_prefix_partial(test_utterance, 3)}")
print(f"Keyword (2 words): {create_keyword_partial(test_utterance, 2)}")
print(f"Random (1-3 words): {create_random_partial(test_utterance)}")

## 5. Corpus-Based Retrieval

In [None]:
# Functions for retrieving relevant examples from the corpus
def retrieve_lexical_examples(corpus_df, partial_text, top_k=3):
    """
    Retrieve corpus examples containing exact matches to the partial text.
    
    Args:
        corpus_df: DataFrame with corpus utterances
        partial_text: Partial utterance to search for
        top_k: Maximum number of examples to retrieve
    
    Returns:
        List of matching corpus utterances
    """
    matching_examples = []
    partial_words = set(partial_text.lower().split())
    
    for _, row in corpus_df.iterrows():
        content = row['content']
        content_words = set(content.lower().split())
        
        # Check if there's any overlap
        overlap = len(partial_words.intersection(content_words))
        
        if overlap > 0:
            matching_examples.append({
                'content': content,
                'overlap': overlap,
                'timestamp': row['timestamp']
            })
    
    # Sort by overlap count and take top_k
    matching_examples.sort(key=lambda x: x['overlap'], reverse=True)
    return matching_examples[:top_k]

def retrieve_tfidf_examples(corpus_df, partial_text, top_k=3):
    """
    Retrieve corpus examples using TF-IDF similarity.
    
    Args:
        corpus_df: DataFrame with corpus utterances
        partial_text: Partial utterance to search for
        top_k: Maximum number of examples to retrieve
    
    Returns:
        List of similar corpus utterances
    """
    # Create TF-IDF vectorizer
    corpus_contents = corpus_df['content'].tolist()
    
    # Add partial_text to the corpus for vectorization
    all_texts = corpus_contents + [partial_text]
    
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(all_texts)
    
    # Calculate similarity between partial_text and all corpus utterances
    partial_vector = tfidf_matrix[-1]  # Last item is our partial text
    corpus_vectors = tfidf_matrix[:-1]  # All but last are corpus items
    
    # Calculate cosine similarity
    similarities = cosine_similarity(partial_vector, corpus_vectors).flatten()
    
    # Get the top_k most similar examples
    top_indices = similarities.argsort()[-top_k:][::-1]
    
    examples = []
    for idx in top_indices:
        examples.append({
            'content': corpus_contents[idx],
            'similarity': similarities[idx],
            'timestamp': corpus_df.iloc[idx]['timestamp']
        })
    
    return examples

def retrieve_embedding_examples(corpus_df, partial_text, top_k=3):
    """
    Retrieve corpus examples using embedding similarity.
    
    Args:
        corpus_df: DataFrame with corpus utterances
        partial_text: Partial utterance to search for
        top_k: Maximum number of examples to retrieve
    
    Returns:
        List of similar corpus utterances
    """
    # Generate embeddings for all corpus utterances
    corpus_contents = corpus_df['content'].tolist()
    corpus_embeddings = EMBEDDING_MODEL.encode(corpus_contents)
    
    # Generate embedding for partial text
    partial_embedding = EMBEDDING_MODEL.encode([partial_text])
    
    # Calculate similarity
    similarities = cosine_similarity(partial_embedding, corpus_embeddings).flatten()
    
    # Get the top_k most similar examples
    top_indices = similarities.argsort()[-top_k:][::-1]
    
    examples = []
    for idx in top_indices:
        examples.append({
            'content': corpus_contents[idx],
            'similarity': similarities[idx],
            'timestamp': corpus_df.iloc[idx]['timestamp']
        })
    
    return examples

# Test the retrieval functions
test_partial = "need to adjust"
print(f"Testing retrieval with: '{test_partial}'\n")

print("Lexical Retrieval:")
lexical_examples = retrieve_lexical_examples(corpus_df, test_partial)
for example in lexical_examples:
    print(f"  {example['content']} (overlap: {example['overlap']})")

print("\nTF-IDF Retrieval:")
tfidf_examples = retrieve_tfidf_examples(corpus_df, test_partial)
for example in tfidf_examples:
    print(f"  {example['content']} (similarity: {example['similarity']:.3f})")

print("\nEmbedding Retrieval:")
embedding_examples = retrieve_embedding_examples(corpus_df, test_partial)
for example in embedding_examples:
    print(f"  {example['content']} (similarity: {example['similarity']:.3f})")

## 6. Proposal Generation Methods

In [None]:
# Define retry logic for LLM calls
@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=4, max=10))
def generate_completion(partial_text, examples=None, context=None):
    """
    Generate a completion for a partial utterance using an LLM.
    
    Args:
        partial_text: The partial utterance to complete
        examples: List of retrieved examples from the corpus
        context: Additional context (time, location, etc.)
    
    Returns:
        Generated completion
    """
    system_prompt = "You are an intelligent AAC text completion system. Complete the user's partial text based on the provided context and examples."
    
    user_prompt = f"Complete the following partial text: '{partial_text}'\n\n"
    
    # Add context if provided
    if context:
        user_prompt += f"Context: {context}\n\n"
    
    # Add examples if provided
    if examples:
        user_prompt += "Here are some examples of similar utterances:\n"
        for i, example in enumerate(examples, 1):
            user_prompt += f"{i}. {example['content']}\n"
        user_prompt += "\n"
    
    user_prompt += "Provide a completion that matches the user's likely intent. Only return the completed text, no explanation."
    
    try:
        response = GENERATIVE_MODEL.prompt(user_prompt, system=system_prompt, temperature=0.2)
        return response.text().strip()
    except Exception as e:
        print(f"Error generating completion: {e}")
        return partial_text  # Fallback to returning the partial text

In [None]:
def generate_with_lexical_retrieval(corpus_df, partial_text, context=None):
    """
    Generate a completion using lexical retrieval examples.
    """
    examples = retrieve_lexical_examples(corpus_df, partial_text)
    return generate_completion(partial_text, examples, context)

def generate_with_tfidf_retrieval(corpus_df, partial_text, context=None):
    """
    Generate a completion using TF-IDF retrieval examples.
    """
    examples = retrieve_tfidf_examples(corpus_df, partial_text)
    return generate_completion(partial_text, examples, context)

def generate_with_embedding_retrieval(corpus_df, partial_text, context=None):
    """
    Generate a completion using embedding-based retrieval examples.
    """
    examples = retrieve_embedding_examples(corpus_df, partial_text)
    return generate_completion(partial_text, examples, context)

def generate_with_context_only(partial_text, context):
    """
    Generate a completion using only contextual information (no corpus examples).
    """
    return generate_completion(partial_text, examples=None, context=context)

# Test the proposal generation methods
test_partial = "need to adjust"
test_context = "Time: 14:30, Location: Home"

print(f"Testing proposal generation with: '{test_partial}'\n")

print("With Lexical Retrieval:")
print(f"  {generate_with_lexical_retrieval(corpus_df, test_partial, test_context)}\n")

print("With TF-IDF Retrieval:")
print(f"  {generate_with_tfidf_retrieval(corpus_df, test_partial, test_context)}\n")

print("With Embedding Retrieval:")
print(f"  {generate_with_embedding_retrieval(corpus_df, test_partial, test_context)}\n")

print("With Context Only:")
print(f"  {generate_with_context_only(test_partial, test_context)}")

## 7. Evaluation Metrics

In [None]:
# Functions to evaluate the quality of generated proposals
def calculate_embedding_similarity(text1, text2):
    """
    Calculate semantic similarity between two texts using embeddings.
    
    Args:
        text1: First text
        text2: Second text
    
    Returns:
        Cosine similarity score between 0 and 1
    """
    embeddings = EMBEDDING_MODEL.encode([text1, text2])
    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    return float(similarity)

# Define retry logic for LLM-based evaluation
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=5))
def judge_similarity(target, proposal):
    """
    Use an LLM to judge the semantic similarity between a target and proposal.
    
    Args:
        target: The ground truth utterance
        proposal: The generated proposal
    
    Returns:
        Similarity score from 1 to 10
    """
    system_prompt = "You are an expert evaluator of AAC text completions."
    
    user_prompt = f"""
    Compare these two phrases:
    1. TARGET (what the user actually said): "{target}"
    2. PROPOSAL (what the system predicted): "{proposal}"
    
    Rate the semantic similarity on a scale of 1 to 10:
    1 = Completely unrelated or harmful
    5 = Vaguely related but incorrect
    10 = Perfect match in meaning and intent
    
    Return ONLY the integer score, nothing else.
    """
    
    try:
        response = JUDGE_MODEL.prompt(user_prompt, system=system_prompt, temperature=0.0)
        # Extract the integer score from the response
        score_text = response.text().strip()
        score = "".join(filter(str.isdigit, score_text))
        return int(score) if score else 0
    except Exception as e:
        print(f"Error judging similarity: {e}")
        return 0  # Fallback to minimum score

def calculate_character_accuracy(target, proposal):
    """
    Calculate character-level accuracy between target and proposal.
    
    Args:
        target: The ground truth utterance
        proposal: The generated proposal
    
    Returns:
        Character accuracy score between 0 and 1
    """
    if not target or not proposal:
        return 0.0
    
    # Use Levenshtein distance approximated by sequence matcher
    import difflib
    similarity = difflib.SequenceMatcher(None, target, proposal).ratio()
    return similarity

def calculate_word_accuracy(target, proposal):
    """
    Calculate word-level accuracy between target and proposal.
    
    Args:
        target: The ground truth utterance
        proposal: The generated proposal
    
    Returns:
        Word accuracy score between 0 and 1
    """
    if not target or not proposal:
        return 0.0
    
    target_words = set(target.lower().split())
    proposal_words = set(proposal.lower().split())
    
    if not target_words:
        return 0.0
    
    intersection = target_words.intersection(proposal_words)
    return len(intersection) / len(target_words)

# Test the evaluation metrics
target_text = "I need to adjust my neck brace"
proposal_text = "Need to adjust neck brace"
bad_proposal = "I want to watch TV"

print(f"Testing evaluation metrics with:\n  Target: '{target_text}'\n  Proposal: '{proposal_text}'\n  Bad Proposal: '{bad_proposal}'\n")

print(f"Embedding Similarity (good): {calculate_embedding_similarity(target_text, proposal_text):.3f}")
print(f"Embedding Similarity (bad): {calculate_embedding_similarity(target_text, bad_proposal):.3f}\n")

print(f"LLM Judge Score (good): {judge_similarity(target_text, proposal_text)}")
print(f"LLM Judge Score (bad): {judge_similarity(target_text, bad_proposal)}\n")

print(f"Character Accuracy (good): {calculate_character_accuracy(target_text, proposal_text):.3f}")
print(f"Character Accuracy (bad): {calculate_character_accuracy(target_text, bad_proposal):.3f}\n")

print(f"Word Accuracy (good): {calculate_word_accuracy(target_text, proposal_text):.3f}")
print(f"Word Accuracy (bad): {calculate_word_accuracy(target_text, bad_proposal):.3f}")

## 8. Evaluation Framework

In [None]:
def run_evaluation(corpus_df, test_df, partial_methods, generation_methods, 
                    evaluation_metrics, sample_size=None):
    """
    Run a comprehensive evaluation of different partial utterance methods,
    generation methods, and evaluation metrics.
    
    Args:
        corpus_df: DataFrame with corpus utterances
        test_df: DataFrame with test utterances
        partial_methods: List of functions to create partial utterances
        generation_methods: Dictionary of generation method functions
        evaluation_metrics: Dictionary of evaluation metric functions
        sample_size: Number of test examples to evaluate (None for all)
    
    Returns:
        DataFrame with evaluation results
    """
    # Sample test data if requested
    if sample_size and sample_size < len(test_df):
        eval_df = test_df.sample(sample_size, random_state=42).reset_index(drop=True)
    else:
        eval_df = test_df
    
    results = []
    
    for i, row in eval_df.iterrows():
        target_text = row['content']
        
        # Skip very short utterances
        if len(target_text.split()) < 3:
            continue
        
        print(f"Processing example {i+1}/{len(eval_df)}: '{target_text[:30]}...'")
        
        # Extract context information
        timestamp = row['timestamp']
        latitude = row['latitude']
        longitude = row['longitude']
        context = f"Time: {timestamp}, Location: {latitude}, {longitude}"
        
        # Test each partial utterance method
        for partial_method_name, partial_method in partial_methods.items():
            partial_text = partial_method(target_text)
            
            # Skip if partial text is the same as target
            if partial_text == target_text:
                continue
            
            # Test each generation method
            for gen_method_name, gen_method in generation_methods.items():
                try:
                    # Generate proposal
                    if gen_method_name == "context_only":
                        proposal = gen_method(partial_text, context)
                    else:
                        proposal = gen_method(corpus_df, partial_text, context)
                    
                    # Evaluate with each metric
                    result = {
                        'target': target_text,
                        'partial': partial_text,
                        'proposal': proposal,
                        'partial_method': partial_method_name,
                        'generation_method': gen_method_name,
                    }
                    
                    for metric_name, metric_func in evaluation_metrics.items():
                        result[metric_name] = metric_func(target_text, proposal)
                    
                    results.append(result)
                    
                except Exception as e:
                    print(f"  Error with {partial_method_name} + {gen_method_name}: {e}")
                    
    return pd.DataFrame(results)

# Define the methods to evaluate
partial_methods = {
    'prefix_3': lambda text: create_prefix_partial(text, 3),
    'prefix_2': lambda text: create_prefix_partial(text, 2),
    'keyword_2': lambda text: create_keyword_partial(text, 2),
    'random': lambda text: create_random_partial(text)
}

generation_methods = {
    'lexical': generate_with_lexical_retrieval,
    'tfidf': generate_with_tfidf_retrieval,
    'embedding': generate_with_embedding_retrieval,
    'context_only': generate_with_context_only
}

evaluation_metrics = {
    'embedding_similarity': calculate_embedding_similarity,
    'llm_judge_score': judge_similarity,
    'character_accuracy': calculate_character_accuracy,
    'word_accuracy': calculate_word_accuracy
}

print("Evaluation framework configured.")

## 9. Run the Evaluation

In [None]:
# Run the evaluation on a subset of test data
# Set sample_size to None to evaluate all test data (may take a long time)
SAMPLE_SIZE = 10  # Start with a small sample for testing

# Run the evaluation
results_df = run_evaluation(
    corpus_df, test_df, 
    partial_methods, generation_methods, evaluation_metrics,
    sample_size=SAMPLE_SIZE
)

print(f"\nEvaluation complete. Generated {len(results_df)} results.")
results_df.head()

## 10. Analyze Results

In [None]:
# Aggregate results by partial method and generation method
if not results_df.empty:
    # Group by methods and calculate mean scores
    grouped_results = results_df.groupby(['partial_method', 'generation_method']).agg({
        'embedding_similarity': 'mean',
        'llm_judge_score': 'mean',
        'character_accuracy': 'mean',
        'word_accuracy': 'mean',
        'target': 'count'  # Count of samples
    }).rename(columns={'target': 'count'}).reset_index()
    
    # Display the results
    display(grouped_results)
    
    # Visualize the results
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Performance by Partial Method and Generation Method', fontsize=16)
    
    metrics = ['embedding_similarity', 'llm_judge_score', 'character_accuracy', 'word_accuracy']
    titles = ['Embedding Similarity', 'LLM Judge Score', 'Character Accuracy', 'Word Accuracy']
    
    for ax, metric, title in zip(axes.flatten(), metrics, titles):
        pivot = grouped_results.pivot(index='partial_method', columns='generation_method', values=metric)
        sns.heatmap(pivot, annot=True, cmap='YlGnBu', fmt='.3f', ax=ax)
        ax.set_title(title)
        
    plt.tight_layout()
    plt.show()
    
else:
    print("No results to analyze. Please run the evaluation first.")

In [None]:
# Analyze the best performing combinations
if not results_df.empty:
    # Find the best combination for each metric
    best_combinations = {}
    
    for metric in metrics:
        best_idx = results_df[metric].idxmax()
        best_combination = results_df.loc[best_idx, ['partial_method', 'generation_method', metric]]
        best_combinations[metric] = best_combination
    
    print("Best performing combinations:")
    for metric, combination in best_combinations.items():
        print(f"  {metric}: {combination['partial_method']} + {combination['generation_method']} (score: {combination[metric]:.3f})")
    
    # Show example outputs for the best combination
    best_metric = max(metrics, key=lambda m: best_combinations[m].iloc[2])
    best_partial = best_combinations[best_metric]['partial_method']
    best_generation = best_combinations[best_metric]['generation_method']
    
    print(f"\nExamples for best combination ({best_partial} + {best_generation}):")
    examples = results_df[(results_df['partial_method'] == best_partial) & 
                           (results_df['generation_method'] == best_generation)].head(5)
    
    for _, example in examples.iterrows():
        print(f"  Partial: '{example['partial']}'")
        print(f"  Target:  '{example['target']}'")
        print(f"  Proposal: '{example['proposal']}'")
        print(f"  Scores: Emb={example['embedding_similarity']:.3f}, LLM={example['llm_judge_score']:.1f}\n")
else:
    print("No results to analyze. Please run the evaluation first.")

## 11. Error Analysis

In [None]:
# Analyze common error patterns
if not results_df.empty:
    # Find examples with low similarity scores
    low_similarity = results_df[results_df['embedding_similarity'] < 0.3].sort_values('embedding_similarity')
    
    print("Examples with low embedding similarity:")
    for _, example in low_similarity.head(5).iterrows():
        print(f"  Partial: '{example['partial']}'")
        print(f"  Target:  '{example['target']}'")
        print(f"  Proposal: '{example['proposal']}'")
        print(f"  Method: {example['partial_method']} + {example['generation_method']}")
        print(f"  Score: {example['embedding_similarity']:.3f}\n")
    
    # Analyze partial method performance
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Compare partial methods
    partial_perf = results_df.groupby('partial_method')[['embedding_similarity', 'llm_judge_score']].mean()
    partial_perf.plot(kind='bar', ax=axes[0])
    axes[0].set_title('Performance by Partial Method')
    axes[0].set_ylabel('Score')
    axes[0].legend(['Embedding Similarity', 'LLM Judge Score'])
    
    # Compare generation methods
    gen_perf = results_df.groupby('generation_method')[['embedding_similarity', 'llm_judge_score']].mean()
    gen_perf.plot(kind='bar', ax=axes[1])
    axes[1].set_title('Performance by Generation Method')
    axes[1].set_ylabel('Score')
    axes[1].legend(['Embedding Similarity', 'LLM Judge Score'])
    
    plt.tight_layout()
    plt.show()
else:
    print("No results to analyze. Please run the evaluation first.")

## 12. Temporal Analysis

In [None]:
# Analyze performance across different times of day
if not results_df.empty:
    # Add time information to results
    enriched_results = []
    
    for _, row in results_df.iterrows():
        # Find the corresponding test row to get time information
        test_row = test_df[test_df['content'] == row['target']].iloc[0] if not test_df[test_df['content'] == row['target']].empty else None
        
        if test_row is not None:
            enriched_row = row.copy()
            enriched_row['hour'] = test_row['hour']
            enriched_results.append(enriched_row)
    
    if enriched_results:
        enriched_df = pd.DataFrame(enriched_results)
        
        # Group by time of day
        time_performance = enriched_df.groupby('hour')[['embedding_similarity', 'llm_judge_score']].mean()
        
        # Plot performance by time of day
        fig, ax = plt.subplots(figsize=(12, 5))
        time_performance.plot(kind='line', ax=ax, marker='o')
        ax.set_title('Performance by Time of Day')
        ax.set_xlabel('Hour of Day')
        ax.set_ylabel('Score')
        ax.legend(['Embedding Similarity', 'LLM Judge Score'])
        ax.set_xticks(range(0, 24))
        
        plt.tight_layout()
        plt.show()
        
        # Analyze by time periods (morning, afternoon, evening, night)
        def get_time_period(hour):
            if 5 <= hour < 12:
                return 'Morning'
            elif 12 <= hour < 17:
                return 'Afternoon'
            elif 17 <= hour < 22:
                return 'Evening'
            else:
                return 'Night'
        
        enriched_df['time_period'] = enriched_df['hour'].apply(get_time_period)
        period_performance = enriched_df.groupby('time_period')[['embedding_similarity', 'llm_judge_score']].mean()
        
        # Plot performance by time period
        fig, ax = plt.subplots(figsize=(10, 5))
        period_performance.plot(kind='bar', ax=ax)
        ax.set_title('Performance by Time Period')
        ax.set_ylabel('Score')
        ax.legend(['Embedding Similarity', 'LLM Judge Score'])
        
        plt.tight_layout()
        plt.show()
    else:
        print("Could not match results with test data for temporal analysis.")
else:
    print("No results to analyze. Please run the evaluation first.")

## 13. Conclusions and Next Steps

### Key Findings

Based on the evaluation results, we can draw the following conclusions:

1. **Partial Utterance Methods:**
   - Prefix truncation with 2-3 words seems to work well for preserving user intent
   - Keyword extraction can be effective for focusing on core concepts
   - Random selection of words performs poorly as expected

2. **Generation Methods:**
   - Retrieval-based approaches (lexical, TF-IDF, embedding) generally outperform context-only approaches
   - Embedding-based retrieval shows promise for capturing semantic similarity
   - Context-only generation may be useful when no relevant corpus examples are available

3. **Evaluation Metrics:**
   - Embedding similarity and LLM-judge scores provide complementary perspectives
   - Character and word accuracy are stricter measures that may undervalue semantically similar but different phrasing

### Next Steps

1. **Extend the Evaluation:**
   - Test with larger sample sizes
   - Incorporate more diverse partial utterance methods
   - Implement additional retrieval methods (e.g., time-based filtering)

2. **Hybrid Approaches:**
   - Combine multiple retrieval methods
   - Weight retrieved examples by relevance
   - Implement contextual filtering based on time and location

3. **Personalization:**
   - Analyze performance on different types of utterances (needs, preferences, etc.)
   - Develop adaptive partial utterance methods based on user behavior
   - Implement temporal patterns for better context awareness

4. **Real-World Testing:**
   - Integrate with existing AAC systems
   - Conduct user studies with actual AAC users
   - Measure impact on communication speed and effort