# ASL Alphabet Recognition: Evaluating LLM Vision Capabilities for Accessibility

## Research Overview
This notebook evaluates current Large Language Models (LLMs) on their ability to identify American Sign Language (ASL) alphabets. The goal is to assess the current state of LLMs in the area of accessibility, specifically their vision capabilities for sign language recognition.

### Methodology
- **Task**: Supervised classification of ASL alphabet images (A-Z)
- **Approach**: Feed individual ASL alphabet images to various LLMs and compare their predictions with ground truth labels
- **Metrics**: Accuracy, confusion matrix, per-class performance, and response time
- **Models to Evaluate**: GPT-4V, Claude 3, Gemini Pro Vision, and others


In [31]:
# 1. Import Required Libraries
import os
import json
import time
import base64
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from PIL import Image
from typing import List, Dict, Tuple, Optional
from datetime import datetime
from tqdm import tqdm

from io import BytesIO

# For ML metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

# For API calls
import openai
import anthropic
import google.generativeai as genai

# Set up plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")
print("Hey")



Libraries imported successfully!
Hey


  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# 2. Configuration and API Setup
CONFIG = {
    'data_path': './asl_alphabet_dataset',  # Path to your ASL dataset
    'results_path': './evaluation_results',
    'api_keys': {
        'openai': os.getenv('OPENAI_API_KEY', ''),
        'anthropic': os.getenv('ANTHROPIC_API_KEY', 'your-anthropic-key-here'),
        'google': os.getenv('GOOGLE_API_KEY', 'your-google-key-here')
    },
    'models': {
        'gpt4v': 'gpt-5',
        'claude': 'claude-3-opus-20240229',
        'gemini': 'gemini-pro-vision'
    },
    'evaluation': {
        'max_samples_per_class': 10,  # Number of images per letter to evaluate
        'timeout': 30,  # API timeout in seconds
        'retry_attempts': 3
    }
}

# Create results directory
Path(CONFIG['results_path']).mkdir(parents=True, exist_ok=True)

# ASL Alphabet classes (A-Z)
ASL_CLASSES = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
print(f"ASL Classes: {ASL_CLASSES}")


In [None]:
# 3. Data Loading Functions
class ASLDataLoader:
    """Load and manage ASL alphabet images for evaluation"""
    
    def __init__(self, data_path: str):
        self.data_path = Path(data_path)
        self.images = []
        self.labels = []
        
    def load_dataset(self, max_per_class: Optional[int] = None) -> Tuple[List, List]:
        """
        Load ASL images from directory structure
        
        Args:
            max_per_class: Maximum number of images to load per class
            
        Returns:
            Tuple of (image_paths, labels)
        """
        if not self.data_path.exists():
            print(f"Warning: Dataset path {self.data_path} does not exist!")
            print("Creating sample dataset structure...")
            self._create_sample_structure()
            return [], []
        
        for letter in ASL_CLASSES:
            letter_path = self.data_path / letter
            if letter_path.exists():
                image_files = list(letter_path.glob('*.jpg')) + \
                             list(letter_path.glob('*.png')) + \
                             list(letter_path.glob('*.jpeg'))
                
                # Limit samples if specified
                if max_per_class:
                    image_files = image_files[:max_per_class]
                
                for img_file in image_files:
                    self.images.append(str(img_file))
                    self.labels.append(letter)
        
        print(f"Loaded {len(self.images)} images across {len(set(self.labels))} classes")
        return self.images, self.labels
    
    def _create_sample_structure(self):
        """Create sample directory structure for demonstration"""
        for letter in ASL_CLASSES[:5]:  # Create folders for first 5 letters
            (self.data_path / letter).mkdir(parents=True, exist_ok=True)
        print(f"Created sample structure at {self.data_path}")
        print("Please add ASL alphabet images to the respective folders")
    
    def get_sample_distribution(self) -> pd.DataFrame:
        """Get distribution of samples per class"""
        if not self.labels:
            return pd.DataFrame()
        
        distribution = pd.Series(self.labels).value_counts().sort_index()
        return pd.DataFrame({
            'Letter': distribution.index,
            'Count': distribution.values
        })
    
    def encode_image_base64(self, image_path: str) -> str:
        """Encode image to base64 for API calls"""
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')
    
    def visualize_samples(self, n_samples: int = 5):
        """Visualize random samples from dataset"""
        if not self.images:
            print("No images loaded!")
            return
        
        sample_indices = np.random.choice(len(self.images), 
                                        min(n_samples, len(self.images)), 
                                        replace=False)
        
        fig, axes = plt.subplots(1, len(sample_indices), figsize=(15, 3))
        if len(sample_indices) == 1:
            axes = [axes]
        
        for idx, ax in zip(sample_indices, axes):
            img = Image.open(self.images[idx])
            ax.imshow(img)
            ax.set_title(f"Label: {self.labels[idx]}")
            ax.axis('off')
        
        plt.tight_layout()
        plt.show()

# Initialize data loader
data_loader = ASLDataLoader(CONFIG['data_path'])
images, labels = data_loader.load_dataset(max_per_class=CONFIG['evaluation']['max_samples_per_class'])

# Show sample distribution
if images:
    distribution = data_loader.get_sample_distribution()
    print("\nDataset Distribution:")
    print(distribution)


## LLM API Interfaces

These classes provide unified interfaces for different LLM providers to perform ASL alphabet classification.


In [None]:
# 4. LLM API Interfaces
class BaseLLMEvaluator:
    """Base class for LLM evaluators"""
    
    def __init__(self, api_key: str, model_name: str):
        self.api_key = api_key
        self.model_name = model_name
        self.responses = []
        self.response_times = []
    
    def create_prompt(self, prompt_style: str = 'standard') -> str:
        """Create prompt for ASL classification"""
        prompts = {
            'standard': """Look at this image of an American Sign Language (ASL) hand sign. 
                          Identify which letter of the alphabet (A-Z) is being shown.
                          Respond with ONLY the single letter, nothing else.""",
            
            'detailed': """This image shows a hand gesture representing a letter in American Sign Language (ASL).
                          Please analyze the hand position, finger configuration, and orientation.
                          Identify which letter from A to Z is being signed.
                          Respond with only the letter (e.g., 'A', 'B', 'C', etc.).""",
            
            'few_shot': """You are an ASL alphabet classifier. Given an image of a hand sign, 
                          identify the letter being shown.
                          Examples of ASL signs:
                          - Closed fist with thumb to the side = 'A'
                          - Open palm with fingers together = 'B'
                          - Curved hand in C-shape = 'C'
                          
                          Now look at the provided image and respond with only the letter being signed.""",
            
            'chain_of_thought': """Analyze this ASL hand sign step by step:
                                   1. First, observe the overall hand position
                                   2. Note the finger configuration
                                   3. Check thumb position
                                   4. Identify the letter being signed
                                   
                                   Final answer (letter only):"""
        }
        return prompts.get(prompt_style, prompts['standard'])
    
    def extract_letter_from_response(self, response: str) -> str:
        """Extract single letter from LLM response"""
        # Clean the response
        response = response.strip().upper()
        
        # Try to find a single letter
        import re
        letters = re.findall(r'[A-Z]', response)
        
        if letters:
            # If response is just a single letter, return it
            if len(response) == 1 and response in ASL_CLASSES:
                return response
            # Otherwise, return the first letter found
            return letters[0]
        
        return "?"  # Unknown if no letter found
    
    def evaluate_image(self, image_path: str, prompt_style: str = 'standard') -> Dict:
        """Evaluate a single image - to be implemented by subclasses"""
        raise NotImplementedError


class OpenAIEvaluator(BaseLLMEvaluator):
    """OpenAI GPT-4V evaluator"""
    
    def __init__(self, api_key: str, model_name: str = "gpt-4-vision-preview"):
        super().__init__(api_key, model_name)
        self.client = openai.OpenAI(api_key=api_key)
    
    def evaluate_image(self, image_path: str, prompt_style: str = 'standard') -> Dict:
        """Evaluate ASL image using GPT-4V"""
        try:
            # Encode image
            base64_image = data_loader.encode_image_base64(image_path)
            
            # Create message
            messages = [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": self.create_prompt(prompt_style)},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}"
                            }
                        }
                    ]
                }
            ]
            
            # Time the API call
            start_time = time.time()
            response = self.client.chat.completions.create(
                model=self.model_name,
                messages=messages,
                max_tokens=50,
                temperature=0
            )
            response_time = time.time() - start_time
            
            # Extract prediction
            raw_response = response.choices[0].message.content
            predicted_letter = self.extract_letter_from_response(raw_response)
            
            return {
                'predicted': predicted_letter,
                'raw_response': raw_response,
                'response_time': response_time,
                'model': self.model_name
            }
            
        except Exception as e:
            return {
                'predicted': '?',
                'raw_response': f"Error: {str(e)}",
                'response_time': 0,
                'model': self.model_name
            }


class AnthropicEvaluator(BaseLLMEvaluator):
    """Anthropic Claude evaluator"""
    
    def __init__(self, api_key: str, model_name: str = "claude-3-opus-20240229"):
        super().__init__(api_key, model_name)
        self.client = anthropic.Anthropic(api_key=api_key)
    
    def evaluate_image(self, image_path: str, prompt_style: str = 'standard') -> Dict:
        """Evaluate ASL image using Claude"""
        try:
            # Encode image
            base64_image = data_loader.encode_image_base64(image_path)
            
            # Create message
            start_time = time.time()
            message = self.client.messages.create(
                model=self.model_name,
                max_tokens=50,
                temperature=0,
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "image",
                                "source": {
                                    "type": "base64",
                                    "media_type": "image/jpeg",
                                    "data": base64_image
                                }
                            },
                            {
                                "type": "text",
                                "text": self.create_prompt(prompt_style)
                            }
                        ]
                    }
                ]
            )
            response_time = time.time() - start_time
            
            # Extract prediction
            raw_response = message.content[0].text
            predicted_letter = self.extract_letter_from_response(raw_response)
            
            return {
                'predicted': predicted_letter,
                'raw_response': raw_response,
                'response_time': response_time,
                'model': self.model_name
            }
            
        except Exception as e:
            return {
                'predicted': '?',
                'raw_response': f"Error: {str(e)}",
                'response_time': 0,
                'model': self.model_name
            }


class GeminiEvaluator(BaseLLMEvaluator):
    """Google Gemini Pro Vision evaluator"""
    
    def __init__(self, api_key: str, model_name: str = "gemini-pro-vision"):
        super().__init__(api_key, model_name)
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel(model_name)
    
    def evaluate_image(self, image_path: str, prompt_style: str = 'standard') -> Dict:
        """Evaluate ASL image using Gemini"""
        try:
            # Load image
            img = Image.open(image_path)
            
            # Generate content
            start_time = time.time()
            response = self.model.generate_content([self.create_prompt(prompt_style), img])
            response_time = time.time() - start_time
            
            # Extract prediction
            raw_response = response.text
            predicted_letter = self.extract_letter_from_response(raw_response)
            
            return {
                'predicted': predicted_letter,
                'raw_response': raw_response,
                'response_time': response_time,
                'model': self.model_name
            }
            
        except Exception as e:
            return {
                'predicted': '?',
                'raw_response': f"Error: {str(e)}",
                'response_time': 0,
                'model': self.model_name
            }


# Initialize evaluators (only if API keys are set)
evaluators = {}

if CONFIG['api_keys']['openai'] != 'your-openai-key-here':
    evaluators['GPT-4V'] = OpenAIEvaluator(
        CONFIG['api_keys']['openai'],
        CONFIG['models']['gpt4v']
    )
    print("✓ OpenAI GPT-4V evaluator initialized")

if CONFIG['api_keys']['anthropic'] != 'your-anthropic-key-here':
    evaluators['Claude'] = AnthropicEvaluator(
        CONFIG['api_keys']['anthropic'],
        CONFIG['models']['claude']
    )
    print("✓ Anthropic Claude evaluator initialized")

if CONFIG['api_keys']['google'] != 'your-google-key-here':
    evaluators['Gemini'] = GeminiEvaluator(
        CONFIG['api_keys']['google'],
        CONFIG['models']['gemini']
    )
    print("✓ Google Gemini evaluator initialized")

if not evaluators:
    print("⚠️ No evaluators initialized. Please set API keys in the CONFIG section.")


In [None]:

# 5. Evaluation Pipeline
class ASLEvaluationPipeline:
    """Main pipeline for evaluating LLMs on ASL alphabet classification"""
    
    def __init__(self, evaluators: Dict, data_loader: ASLDataLoader):
        self.evaluators = evaluators
        self.data_loader = data_loader
        self.results = []
        
    def run_evaluation(self, 
                       images: List[str], 
                       labels: List[str],
                       prompt_styles: List[str] = ['standard'],
                       save_results: bool = True) -> pd.DataFrame:
        """
        Run evaluation across all models and images
        
        Args:
            images: List of image paths
            labels: List of true labels
            prompt_styles: List of prompt styles to test
            save_results: Whether to save results to file
            
        Returns:
            DataFrame with evaluation results
        """
        
        if not images:
            print("No images to evaluate!")
            return pd.DataFrame()
        
        total_evaluations = len(images) * len(self.evaluators) * len(prompt_styles)
        print(f"Starting evaluation: {total_evaluations} total predictions")
        print(f"Images: {len(images)}, Models: {len(self.evaluators)}, Prompt styles: {len(prompt_styles)}")
        print("-" * 50)
        
        with tqdm(total=total_evaluations, desc="Evaluating") as pbar:
            for prompt_style in prompt_styles:
                for img_path, true_label in zip(images, labels):
                    for model_name, evaluator in self.evaluators.items():
                        # Evaluate image
                        result = evaluator.evaluate_image(img_path, prompt_style)
                        
                        # Store result
                        self.results.append({
                            'image_path': img_path,
                            'true_label': true_label,
                            'predicted_label': result['predicted'],
                            'model': model_name,
                            'prompt_style': prompt_style,
                            'raw_response': result['raw_response'],
                            'response_time': result['response_time'],
                            'correct': result['predicted'] == true_label,
                            'timestamp': datetime.now().isoformat()
                        })
                        
                        pbar.update(1)
                        
                        # Small delay to avoid rate limiting
                        time.sleep(0.1)
        
        # Convert to DataFrame
        results_df = pd.DataFrame(self.results)
        
        # Save results
        if save_results and len(results_df) > 0:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            results_path = Path(CONFIG['results_path']) / f"asl_evaluation_{timestamp}.csv"
            results_df.to_csv(results_path, index=False)
            print(f"\nResults saved to: {results_path}")
        
        return results_df
    
    def run_single_image_test(self, image_path: str, true_label: str = None) -> pd.DataFrame:
        """Test all models on a single image"""
        results = []
        
        print(f"\nTesting image: {image_path}")
        if true_label:
            print(f"True label: {true_label}")
        print("-" * 30)
        
        for model_name, evaluator in self.evaluators.items():
            result = evaluator.evaluate_image(image_path, 'standard')
            print(f"{model_name}: {result['predicted']} (Time: {result['response_time']:.2f}s)")
            
            results.append({
                'model': model_name,
                'predicted': result['predicted'],
                'response_time': result['response_time'],
                'raw_response': result['raw_response']
            })
        
        return pd.DataFrame(results)
    
    def compare_prompt_strategies(self, sample_images: List[str], sample_labels: List[str]) -> pd.DataFrame:
        """Compare different prompting strategies on a sample of images"""
        prompt_styles = ['standard', 'detailed', 'few_shot', 'chain_of_thought']
        results = []
        
        print("\nComparing prompt strategies...")
        print(f"Testing {len(sample_images)} images with {len(prompt_styles)} prompt styles")
        
        for img_path, true_label in zip(sample_images, sample_labels):
            for prompt_style in prompt_styles:
                for model_name, evaluator in self.evaluators.items():
                    result = evaluator.evaluate_image(img_path, prompt_style)
                    results.append({
                        'true_label': true_label,
                        'predicted_label': result['predicted'],
                        'model': model_name,
                        'prompt_style': prompt_style,
                        'correct': result['predicted'] == true_label,
                        'response_time': result['response_time']
                    })
        
        return pd.DataFrame(results)

# Initialize evaluation pipeline
pipeline = ASLEvaluationPipeline(evaluators, data_loader)

print("Evaluation pipeline ready!")
print(f"Available models: {list(evaluators.keys()) if evaluators else 'None'}")


In [None]:
# 6. Metrics and Visualization
class MetricsAnalyzer:
    """Calculate and visualize evaluation metrics"""
    
    def __init__(self, results_df: pd.DataFrame):
        self.results = results_df
        
    def calculate_overall_metrics(self) -> pd.DataFrame:
        """Calculate overall accuracy metrics per model"""
        if self.results.empty:
            return pd.DataFrame()
        
        metrics = []
        for model in self.results['model'].unique():
            model_results = self.results[self.results['model'] == model]
            
            # Calculate metrics
            accuracy = model_results['correct'].mean() * 100
            avg_response_time = model_results['response_time'].mean()
            total_predictions = len(model_results)
            
            metrics.append({
                'Model': model,
                'Accuracy (%)': round(accuracy, 2),
                'Avg Response Time (s)': round(avg_response_time, 2),
                'Total Predictions': total_predictions
            })
        
        return pd.DataFrame(metrics).sort_values('Accuracy (%)', ascending=False)
    
    def calculate_per_class_accuracy(self, model_name: str) -> pd.DataFrame:
        """Calculate per-class accuracy for a specific model"""
        model_results = self.results[self.results['model'] == model_name]
        
        if model_results.empty:
            return pd.DataFrame()
        
        # Calculate accuracy per class
        class_metrics = []
        for letter in ASL_CLASSES:
            class_results = model_results[model_results['true_label'] == letter]
            if len(class_results) > 0:
                accuracy = class_results['correct'].mean() * 100
                class_metrics.append({
                    'Letter': letter,
                    'Accuracy (%)': round(accuracy, 2),
                    'Samples': len(class_results)
                })
        
        return pd.DataFrame(class_metrics)
    
    def plot_model_comparison(self):
        """Create bar plot comparing model accuracies"""
        metrics = self.calculate_overall_metrics()
        
        if metrics.empty:
            print("No data to plot!")
            return
        
        plt.figure(figsize=(10, 6))
        bars = plt.bar(metrics['Model'], metrics['Accuracy (%)'])
        
        # Color code based on performance
        colors = ['green' if acc >= 80 else 'orange' if acc >= 60 else 'red' 
                 for acc in metrics['Accuracy (%)']]
        for bar, color in zip(bars, colors):
            bar.set_color(color)
        
        plt.xlabel('Model', fontsize=12)
        plt.ylabel('Accuracy (%)', fontsize=12)
        plt.title('ASL Alphabet Classification Accuracy by Model', fontsize=14, fontweight='bold')
        plt.ylim(0, 105)
        
        # Add value labels on bars
        for bar in bars:
            height = bar.get_height()
            plt.text(bar.get_x() + bar.get_width()/2., height + 1,
                    f'{height:.1f}%', ha='center', va='bottom')
        
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        plt.show()
    
    def plot_confusion_matrix(self, model_name: str):
        """Plot confusion matrix for a specific model"""
        model_results = self.results[self.results['model'] == model_name]
        
        if model_results.empty:
            print(f"No results for model: {model_name}")
            return
        
        # Get true and predicted labels
        y_true = model_results['true_label'].values
        y_pred = model_results['predicted_label'].values
        
        # Create confusion matrix
        cm = confusion_matrix(y_true, y_pred, labels=ASL_CLASSES)
        
        # Plot
        plt.figure(figsize=(20, 18))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                   xticklabels=ASL_CLASSES, yticklabels=ASL_CLASSES,
                   cbar_kws={'label': 'Count'})
        plt.xlabel('Predicted Label', fontsize=12)
        plt.ylabel('True Label', fontsize=12)
        plt.title(f'Confusion Matrix - {model_name}', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
    
    def plot_per_class_performance(self):
        """Plot per-class accuracy for all models"""
        fig, axes = plt.subplots(1, len(self.results['model'].unique()), 
                                figsize=(15, 6), sharey=True)
        
        if len(self.results['model'].unique()) == 1:
            axes = [axes]
        
        for ax, model in zip(axes, self.results['model'].unique()):
            class_metrics = self.calculate_per_class_accuracy(model)
            
            if not class_metrics.empty:
                bars = ax.bar(class_metrics['Letter'], class_metrics['Accuracy (%)'])
                
                # Color code bars
                for bar, acc in zip(bars, class_metrics['Accuracy (%)']):
                    color = 'green' if acc >= 80 else 'orange' if acc >= 60 else 'red'
                    bar.set_color(color)
                
                ax.set_xlabel('ASL Letter', fontsize=10)
                ax.set_title(model, fontsize=12, fontweight='bold')
                ax.set_ylim(0, 105)
                ax.grid(axis='y', alpha=0.3)
                
                # Rotate x labels for better readability
                ax.set_xticklabels(class_metrics['Letter'], rotation=45)
        
        axes[0].set_ylabel('Accuracy (%)', fontsize=12)
        plt.suptitle('Per-Class Accuracy Comparison', fontsize=14, fontweight='bold', y=1.02)
        plt.tight_layout()
        plt.show()
    
    def plot_response_time_comparison(self):
        """Plot response time comparison across models"""
        plt.figure(figsize=(10, 6))
        
        # Create box plot
        models = self.results['model'].unique()
        response_times = [self.results[self.results['model'] == model]['response_time'].values 
                         for model in models]
        
        bp = plt.boxplot(response_times, labels=models, patch_artist=True)
        
        # Color the boxes
        colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
        for patch, color in zip(bp['boxes'], colors[:len(models)]):
            patch.set_facecolor(color)
        
        plt.xlabel('Model', fontsize=12)
        plt.ylabel('Response Time (seconds)', fontsize=12)
        plt.title('Response Time Distribution by Model', fontsize=14, fontweight='bold')
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        plt.show()
    
    def plot_prompt_strategy_comparison(self):
        """Compare performance across different prompt strategies"""
        if 'prompt_style' not in self.results.columns:
            print("No prompt strategy comparison data available")
            return
        
        # Calculate accuracy for each model-prompt combination
        pivot_data = self.results.pivot_table(
            values='correct',
            index='prompt_style',
            columns='model',
            aggfunc='mean'
        ) * 100
        
        # Plot
        ax = pivot_data.plot(kind='bar', figsize=(12, 6), rot=45)
        plt.xlabel('Prompt Strategy', fontsize=12)
        plt.ylabel('Accuracy (%)', fontsize=12)
        plt.title('Performance by Prompt Strategy', fontsize=14, fontweight='bold')
        plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        plt.show()
    
    def generate_classification_report(self, model_name: str):
        """Generate detailed classification report for a model"""
        model_results = self.results[self.results['model'] == model_name]
        
        if model_results.empty:
            print(f"No results for model: {model_name}")
            return
        
        # Get classification report
        report = classification_report(
            model_results['true_label'],
            model_results['predicted_label'],
            labels=ASL_CLASSES,
            zero_division=0
        )
        
        print(f"\nClassification Report - {model_name}")
        print("=" * 50)
        print(report)
        
        return report
    
    def identify_common_mistakes(self, model_name: str, top_n: int = 10):
        """Identify most common misclassifications for a model"""
        model_results = self.results[
            (self.results['model'] == model_name) & 
            (self.results['correct'] == False)
        ]
        
        if model_results.empty:
            print(f"No mistakes found for {model_name}!")
            return pd.DataFrame()
        
        # Create mistake pairs
        mistakes = model_results.groupby(['true_label', 'predicted_label']).size().reset_index(name='count')
        mistakes = mistakes.sort_values('count', ascending=False).head(top_n)
        mistakes['mistake_pair'] = mistakes['true_label'] + ' → ' + mistakes['predicted_label']
        
        print(f"\nTop {top_n} Common Mistakes - {model_name}")
        print("=" * 40)
        for _, row in mistakes.iterrows():
            print(f"{row['mistake_pair']}: {row['count']} times")
        
        return mistakes

# Example usage (will only work if evaluation has been run)
if 'results_df' in locals() and not results_df.empty:
    analyzer = MetricsAnalyzer(results_df)
    
    # Show overall metrics
    print("\nOverall Model Performance:")
    print(analyzer.calculate_overall_metrics())
    
    # Create visualizations
    analyzer.plot_model_comparison()
else:
    print("Run evaluation first to generate metrics and visualizations")


## Summary and Next Steps

### Key Insights from Evaluation
After running the evaluation, you should have insights on:
1. **Overall Accuracy**: Which LLM performs best on ASL alphabet recognition
2. **Per-Class Performance**: Which letters are easiest/hardest for LLMs to identify
3. **Common Confusions**: Which letter pairs are frequently misclassified
4. **Response Times**: Speed vs accuracy trade-offs
5. **Prompt Engineering**: Which prompting strategies work best

### Research Findings
Document your findings regarding:
- Current state of LLM vision capabilities for accessibility
- Specific challenges in ASL recognition
- Recommendations for improving LLM performance on sign language

### Next Steps
1. **Expand Dataset**: Test with more diverse ASL images (different backgrounds, hand positions, lighting)
2. **Test More Models**: Include additional LLMs as they become available
3. **Fine-tuning**: Consider fine-tuning models specifically for ASL recognition
4. **Real-world Testing**: Test with real ASL users and varying hand appearances
5. **Extended Alphabet**: Include numbers, common words, and phrases beyond just letters

### Export Results
Results are automatically saved to CSV files in the `evaluation_results` folder for further analysis.
