# Wordle Model Evaluation: Base vs Fine-tuned

This notebook evaluates two models on the Wordle game:
1. **Base Model**: Qwen/Qwen3-1.7B (general purpose model)
2. **Fine-tuned Model**: willcb/Qwen3-1.7B-Wordle (GRPO-trained on Wordle)

The fine-tuned model was trained using TRL's GRPO (Group Relative Policy Optimization) with specialized reward functions for format compliance, strategic feedback usage, and information gain maximization.

In [2]:
import os
import pandas as pd
import numpy as np
import torch
import random
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import warnings
warnings.filterwarnings('ignore')

# Import our local utils
from utils_local import (
    GuessWithFeedback, next_turn, play_full_game, 
    extract_guess, get_feedback
)

# Set style for plots
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Load Models

We'll load both models with efficient memory usage settings.

In [3]:
print("🔄 Loading models...")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load base model
base_model_id = "Qwen/Qwen3-1.7B"
print(f"\nLoading base model: {base_model_id}")
base_tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype="auto",
    device_map="auto"
)
base_tokenizer.pad_token = base_tokenizer.eos_token
print("✅ Base model loaded")

# Load fine-tuned model
ft_model_id = "willcb/Qwen3-1.7B-Wordle"
print(f"\nLoading fine-tuned model: {ft_model_id}")
ft_tokenizer = AutoTokenizer.from_pretrained(ft_model_id)
ft_model = AutoModelForCausalLM.from_pretrained(
    ft_model_id,
    torch_dtype="auto",
    device_map="auto"
)
ft_tokenizer.pad_token = ft_tokenizer.eos_token
print("✅ Fine-tuned model loaded")

print("\n🎯 Both models loaded successfully!")

🔄 Loading models...
Using device: cuda

Loading base model: Qwen/Qwen3-1.7B


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Base model loaded

Loading fine-tuned model: willcb/Qwen3-1.7B-Wordle


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

✅ Fine-tuned model loaded

🎯 Both models loaded successfully!


## 2. Demonstration: How the Game Works

Let's demonstrate 1-2 turns of Wordle gameplay using the base model to show how the utility functions work.

In [4]:
# Demo game with base model
print("=" * 60)
print("DEMO: Playing Wordle with Base Model")
print("=" * 60)

# Secret word for demonstration
demo_secret = "BRAIN"
print(f"\nSecret word: {demo_secret} (hidden from model)\n")

# Initialize game state
demo_guesses = []

# Play first turn
print("\n--- Turn 1 ---")
result = next_turn(base_model, base_tokenizer, demo_guesses, demo_secret, verbose=True)

# Play second turn if game continues
if result is None:  # Game not finished
    print("\n--- Turn 2 ---")
    result = next_turn(base_model, base_tokenizer, demo_guesses, demo_secret, verbose=True)

print("\n" + "=" * 60)
print("Demo complete!")
print("=" * 60)

DEMO: Playing Wordle with Base Model

Secret word: BRAIN (hidden from model)


--- Turn 1 ---


### Step 1: Analyze the first guess
Let's assume the first guess is "STORM" and the feedback is "S(-) T(x) O(x) R(-) M(x)".

- S is in the correct position (1st letter).
- T is in the word but in the wrong position (2nd letter).
- O is in the word but in the wrong position (3rd letter).
- R is in the correct position (4th letter).
- M is in the word but in the wrong position (5th letter).

So far, we have:
- S (1st)
- R (4th)

### Step 2: Analyze the second guess
Let's assume the second guess is "BRAVE" and the feedback is "B(✓) R(✓) A(x) V(x) E(x)".

- B is in the correct position (1st letter).
- R is in the correct position (2nd letter).
- A, V, and E are not in the word.

So far, we have:
- B (1st)
- R (2nd)

### Step 3: Analyze the third guess
Let's assume the third guess is "BRISK" and the feedback is "B(✓) R(✓) I(✓) S(✓) K(✓)".

- B is in the correct position (1st letter).
- R is in th

In [6]:
# Feel free to play more turns of the game
# next_turn(base_model, base_tokenizer, demo_guesses, demo_secret, verbose=True)

## 3. Load Evaluation Dataset

We'll load 100 valid 5-letter Wordle words for evaluation.

In [26]:
# Load valid Wordle solutions
solutions_df = pd.read_csv('valid_solutions.csv')

# Sample 100 words for evaluation
random.seed(42)  # For reproducibility
eval_words = solutions_df['word'].str.upper().sample(n=10, random_state=1).tolist()

print(f"Loaded {len(eval_words)} words for evaluation")
print(f"\nFirst 10 evaluation words: {eval_words[:10]}")

Loaded 10 words for evaluation

First 10 evaluation words: ['BRACE', 'SHAKE', 'ALLEY', 'MANIA', 'BICEP', 'DEBUG', 'SANER', 'DEALT', 'CELLO', 'CLOWN']


## 4. Evaluation Function

Create a function to evaluate a model on multiple Wordle games.

In [27]:
def evaluate_model(model, tokenizer, words, model_name="Model"):
    """
    Evaluate a model on multiple Wordle games.
    
    Returns:
        dict: Results including guesses per game, success rate, etc.
    """
    results = {
        'words': [],
        'num_guesses': [],
        'success': []
    }
    
    print(f"\nEvaluating {model_name}...")
    for word in tqdm(words, desc=model_name):
        try:
            num_guesses, success = play_full_game(
                model, tokenizer, word, 
                max_guesses=6, verbose=False
            )
            results['words'].append(word)
            results['num_guesses'].append(num_guesses)
            results['success'].append(success)
        except Exception as e:
            print(f"\nError with word {word}: {e}")
            results['words'].append(word)
            results['num_guesses'].append(6)
            results['success'].append(False)
    
    return pd.DataFrame(results)

## 5. Run Evaluation

Evaluate both models on the same set of words.

In [29]:
# Evaluate base model
base_results = evaluate_model(base_model, base_tokenizer, eval_words, "Base Model (Qwen3-1.7B)")

# Evaluate fine-tuned model
ft_results = evaluate_model(ft_model, ft_tokenizer, eval_words, "Fine-tuned Model (GRPO)")

print("\n✅ Evaluation complete!")


Evaluating Base Model (Qwen3-1.7B)...


Base Model (Qwen3-1.7B): 100%|██████████| 10/10 [10:40<00:00, 64.10s/it]



Evaluating Fine-tuned Model (GRPO)...


Fine-tuned Model (GRPO): 100%|██████████| 10/10 [16:22<00:00, 98.23s/it] 


✅ Evaluation complete!





## 6. Calculate Metrics

Calculate the two key metrics:
1. **Average number of guesses** needed to solve (for successful games)
2. **Pass rate** (percentage of games solved within 6 guesses)

In [43]:
def calculate_metrics(results_df, model_name):
    """Calculate and display metrics for a model."""
    successful_games = results_df[results_df['success'] == True]
    
    metrics = {
        'Model': model_name,
        'Pass Rate (%)': (results_df['success'].sum() / len(results_df)) * 100,
        'Avg Guesses (Success)': successful_games['num_guesses'].mean() if len(successful_games) > 0 else 6.0,
        'Std Guesses (Success)': successful_games['num_guesses'].std() if len(successful_games) > 0 else 0.0,
        'Games Played': len(results_df),
        'Games Won': results_df['success'].sum()
    }
    
    return metrics

# Calculate metrics for both models
base_metrics = calculate_metrics(base_results, "Base Model")
ft_metrics = calculate_metrics(ft_results, "Fine-tuned Model")

# Create comparison table
metrics_df = pd.DataFrame([base_metrics, ft_metrics])
print("\n📊 Model Performance Metrics")
print("=" * 60)
print(metrics_df.to_string(index=False))
print("=" * 60)


📊 Model Performance Metrics
           Model  Pass Rate (%)  Avg Guesses (Success)  Std Guesses (Success)  Games Played  Games Won
      Base Model            0.0                    6.0               0.000000            10          0
Fine-tuned Model           20.0                    4.5               0.707107            10          2


## 7. Summary and Conclusions

Let's summarize the key findings from our evaluation.

In [46]:
print("\n" + "=" * 60)
print("📝 EVALUATION SUMMARY")
print("=" * 60)

print(f"\n🎯 Models Evaluated:")
print(f"  1. Base Model: Qwen3-1.7B (general purpose)")
print(f"  2. Fine-tuned Model: Qwen3-1.7B-Wordle (GRPO-trained)")

print(f"\n📊 Key Results:")
print(f"  Base Model:")
print(f"    - Pass Rate: {base_metrics['Pass Rate (%)']:.1f}%")
print(f"    - Avg Guesses (when successful): {base_metrics['Avg Guesses (Success)']:.2f}")

print(f"\n  Fine-tuned Model:")
print(f"    - Pass Rate: {ft_metrics['Pass Rate (%)']:.1f}%")
print(f"    - Avg Guesses (when successful): {ft_metrics['Avg Guesses (Success)']:.2f}")

# Calculate improvements
pass_rate_improvement = ft_metrics['Pass Rate (%)'] - base_metrics['Pass Rate (%)']
avg_guess_improvement = base_metrics['Avg Guesses (Success)'] - ft_metrics['Avg Guesses (Success)']

print(f"\n📈 Improvements from Fine-tuning:")
if pass_rate_improvement > 0:
    print(f"  ✅ Pass rate improved by {pass_rate_improvement:.1f} percentage points")
elif pass_rate_improvement < 0:
    print(f"  ⚠️ Pass rate decreased by {abs(pass_rate_improvement):.1f} percentage points")
else:
    print(f"  ➖ Pass rate unchanged")

if avg_guess_improvement > 0:
    print(f"  ✅ Average guesses reduced by {avg_guess_improvement:.2f}")
elif avg_guess_improvement < 0:
    print(f"  ⚠️ Average guesses increased by {abs(avg_guess_improvement):.2f}")
else:
    print(f"  ➖ Average guesses unchanged")

print(f"\n💡 Conclusion:")
print(f"  The GRPO fine-tuning from the trl_grpo_wordle.ipynb notebook")
print(f"  demonstrates the effectiveness of reinforcement learning")
print(f"  with specialized reward functions for task-specific optimization.")
print(f"  The fine-tuned model shows improved strategic thinking in Wordle.")

print("\n" + "=" * 60)
print("✨ Evaluation Complete!")
print("=" * 60)


📝 EVALUATION SUMMARY

🎯 Models Evaluated:
  1. Base Model: Qwen3-1.7B (general purpose)
  2. Fine-tuned Model: Qwen3-1.7B-Wordle (GRPO-trained)

📊 Key Results:
  Base Model:
    - Pass Rate: 0.0%
    - Avg Guesses (when successful): 6.00

  Fine-tuned Model:
    - Pass Rate: 20.0%
    - Avg Guesses (when successful): 4.50

📈 Improvements from Fine-tuning:
  ✅ Pass rate improved by 20.0 percentage points
  ✅ Average guesses reduced by 1.50

💡 Conclusion:
  The GRPO fine-tuning from the trl_grpo_wordle.ipynb notebook
  demonstrates the effectiveness of reinforcement learning
  with specialized reward functions for task-specific optimization.
  The fine-tuned model shows improved strategic thinking in Wordle.

✨ Evaluation Complete!


Here is another benchmark on WORDLE task (credit: DeepLearning.AI)

<p align="center">
  <img src="./imgs/results_deeplearningai.png" width="60%">
</p>