# WebArena Evaluation - Open Source Models

Evaluation of open-source LLMs vs LCA on WebArena tasks:
- **Llama 3.1 8B Instruct**
- **Gemma 2 9B Instruct**
- **Qwen 2.5 7B Instruct**
- **Phi-3 Mini 4K**
- **LCA** (Multi-agent coordination)

## Setup Instructions

1. **Runtime**: Set to GPU (Runtime → Change runtime type → GPU)
2. **Upload**: Upload `webarena_task.json` to Colab files
3. **Run**: Execute cells in order

In [None]:
# Check GPU
!nvidia-smi

In [None]:
# Install dependencies
!pip install -q transformers accelerate bitsandbytes selenium pandas scipy torch

In [None]:
# Setup Chrome for browser automation
!apt-get update
!apt-get install -y chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
print("✓ Chrome installed")

In [None]:
# Upload webarena_task.json
from google.colab import files
print("Please upload webarena_task.json:")
uploaded = files.upload()
print(f"✓ Uploaded: {list(uploaded.keys())}")

In [None]:
# Download the evaluation script
!wget -q https://raw.githubusercontent.com/YOUR_USERNAME/Three-tier-memory/main/webarena_evaluation_opensource.py
print("✓ Script downloaded")

## Alternative: Paste Script Directly

If you can't download from GitHub, copy the entire `webarena_evaluation_opensource.py` content here:

In [None]:
# Run the evaluation
!python webarena_evaluation_opensource.py

## Configuration Options

Edit these in the script before running:

```python
N_TRIALS = 10        # Trials per task (10 for full evaluation)
MAX_TASKS = 50       # Number of tasks (50 for paper)
n_agents = 5         # LCA agents (5 for paper)
```

## Download Results

After evaluation completes, download the results:

In [None]:
# Download results
from google.colab import files
import os

results_dir = 'webarena_results'
if os.path.exists(results_dir):
    for filename in os.listdir(results_dir):
        filepath = os.path.join(results_dir, filename)
        print(f"Downloading {filename}...")
        files.download(filepath)
    print("✓ All results downloaded")
else:
    print("❌ No results directory found")

## Quick Analysis

View results summary:

In [None]:
import pandas as pd
import glob

# Find latest CSV
csv_files = glob.glob('webarena_results/webarena_opensource_*.csv')
if csv_files:
    latest_csv = max(csv_files)
    df = pd.read_csv(latest_csv)
    
    print("\n" + "="*60)
    print("RESULTS SUMMARY")
    print("="*60)
    
    summary = df.groupby('agent').agg({
        'success': ['mean', 'std', 'count'],
        'time': ['mean', 'std'],
        'quality': ['mean', 'std']
    }).round(3)
    
    print(summary)
    
    # Plot success rates
    import matplotlib.pyplot as plt
    
    agents = df['agent'].unique()
    success_rates = [df[df['agent'] == agent]['success'].mean() for agent in agents]
    
    plt.figure(figsize=(10, 6))
    plt.bar(agents, success_rates)
    plt.xlabel('Agent')
    plt.ylabel('Success Rate')
    plt.title('WebArena Success Rates by Agent')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("❌ No results found")

## Troubleshooting

### Out of Memory
- Use fewer agents: `n_agents=3` instead of 5
- Reduce trials: `N_TRIALS=5` instead of 10
- Run models one at a time

### Chrome Driver Issues
```bash
!apt-get update
!apt-get install -y chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
```

### Model Loading Fails
- Check Hugging Face Hub status
- Try different model (some require access approval)
- Ensure GPU runtime is enabled