# Wordle Policy Gradient Training with Qwen

This notebook demonstrates training a Qwen language model using policy gradient methods (REINFORCE) to solve Wordle puzzles.

## Overview

- **Algorithm**: REINFORCE (Policy Gradient)
- **Model**: Qwen2.5-0.5B (configurable)
- **Environment**: Wordle game via Prime Intellect/TextArena
- **Objective**: Learn to play Wordle through reinforcement learning

## Setup and Imports

In [1]:
import sys
import os

# Add parent directory to path to import src modules
sys.path.insert(0, os.path.abspath('..'))

import torch
import numpy as np
import matplotlib.pyplot as plt
from src.policy_gradient import PolicyGradientTrainer, TrainingConfig
from src.wordle import load_environment

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.10.0+cu128
CUDA available: False


## Configuration

Configure the training parameters. You can adjust these based on your compute resources and requirements.

In [2]:
# Training configuration
config = TrainingConfig(
    model_name="Qwen/Qwen2.5-0.5B",  # Small model for testing - can use larger models
    learning_rate=1e-5,              # Learning rate for optimizer
    batch_size=4,                     # Batch size for training
    num_epochs=5,                     # Number of training epochs
    max_length=512,                   # Maximum sequence length
    device="cuda" if torch.cuda.is_available() else "cpu",
    clip_grad_norm=1.0,              # Gradient clipping threshold
    num_train_examples=50,            # Number of training examples
    num_eval_examples=10,             # Number of evaluation examples
    seed=42,                          # Random seed for reproducibility
)

print("Training Configuration:")
print(f"  Model: {config.model_name}")
print(f"  Learning Rate: {config.learning_rate}")
print(f"  Batch Size: {config.batch_size}")
print(f"  Epochs: {config.num_epochs}")
print(f"  Training Examples: {config.num_train_examples}")
print(f"  Eval Examples: {config.num_eval_examples}")
print(f"  Device: {config.device}")

Training Configuration:
  Model: Qwen/Qwen2.5-0.5B
  Learning Rate: 1e-05
  Batch Size: 4
  Epochs: 5
  Training Examples: 50
  Eval Examples: 10
  Device: cpu


## Initialize Trainer

Create the policy gradient trainer with the Qwen model.

In [3]:
# Initialize trainer
trainer = PolicyGradientTrainer(config)
print("\nTrainer initialized successfully!")

Loading model: Qwen/Qwen2.5-0.5B


`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

2026-02-04 20:40:04 - verifiers.rubrics.rubric.RubricGroup - INFO - Initialized RubricGroup with 2 rubrics



Trainer initialized successfully!


## Test Single Trajectory Generation

Before training, let's test generating a single trajectory to see how the model interacts with the environment.

In [None]:
# Test generating a single trajectory
print("Testing trajectory generation...")
trajectory, reward = trainer.generate_trajectory(example_idx=0)

print(f"\nGenerated text (first 200 chars):")
print(trajectory["generated_text"][:200])
print(f"\nReward: {reward:.4f}")
print(f"Log probabilities sum: {trajectory['log_probs_sum'].item():.4f}")

Testing trajectory generation...


## Training

Train the model using REINFORCE policy gradient algorithm. The model will learn to maximize rewards by adjusting its policy based on the rewards received.

In [None]:
# Track training metrics
training_losses = []
training_rewards = []

# Override train method to track metrics
original_train = trainer.train

def train_with_tracking():
    """Modified train method that tracks metrics."""
    print(f"Starting training on {trainer.config.num_train_examples} examples")
    print(f"Device: {trainer.device}\n")
    
    for epoch in range(trainer.config.num_epochs):
        print(f"Epoch {epoch + 1}/{trainer.config.num_epochs}")
        
        # Create batches
        example_indices = list(range(trainer.config.num_train_examples))
        np.random.shuffle(example_indices)
        
        epoch_losses = []
        epoch_rewards = []
        
        # Process in batches
        for i in range(0, len(example_indices), trainer.config.batch_size):
            batch_indices = example_indices[i:i + trainer.config.batch_size]
            metrics = trainer.train_step(batch_indices)
            
            epoch_losses.append(metrics["loss"])
            epoch_rewards.append(metrics["avg_reward"])
            
            print(
                f"  Batch {i // trainer.config.batch_size + 1}: "
                f"Loss={metrics['loss']:.4f}, "
                f"Reward={metrics['avg_reward']:.4f} Â± {metrics['std_reward']:.4f}"
            )
        
        avg_loss = np.mean(epoch_losses)
        avg_reward = np.mean(epoch_rewards)
        
        training_losses.append(avg_loss)
        training_rewards.append(avg_reward)
        
        print(
            f"Epoch {epoch + 1} Summary: "
            f"Avg Loss={avg_loss:.4f}, "
            f"Avg Reward={avg_reward:.4f}\n"
        )

# Run training
train_with_tracking()

Starting training on 50 examples
Device: cpu

Epoch 1/5


## Visualize Training Progress

Plot the training loss and rewards over epochs.

In [None]:
# Plot training progress
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot loss
ax1.plot(range(1, len(training_losses) + 1), training_losses, 'b-o')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss')
ax1.grid(True)

# Plot rewards
ax2.plot(range(1, len(training_rewards) + 1), training_rewards, 'g-o')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Average Reward')
ax2.set_title('Training Rewards')
ax2.grid(True)

plt.tight_layout()
plt.show()

print(f"Final training loss: {training_losses[-1]:.4f}")
print(f"Final training reward: {training_rewards[-1]:.4f}")

## Evaluation

Evaluate the trained model on the evaluation set to measure its performance.

In [None]:
# Evaluate the model
eval_metrics = trainer.evaluate()

print("\nEvaluation Summary:")
for key, value in eval_metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

## Test Model on Sample Examples

Let's see how the trained model performs on a few sample examples.

In [None]:
# Test on a few examples
num_test_examples = 5
print(f"Testing on {num_test_examples} examples:\n")

for i in range(num_test_examples):
    trajectory, reward = trainer.generate_trajectory(example_idx=i)
    
    print(f"Example {i + 1}:")
    print(f"  Reward: {reward:.4f}")
    print(f"  Generated text preview: {trajectory['generated_text'][:150]}...")
    print()

## Model Information

Display information about the model architecture and parameters.

In [None]:
# Model information
total_params = sum(p.numel() for p in trainer.model.parameters())
trainable_params = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)

print("Model Information:")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Model dtype: {next(trainer.model.parameters()).dtype}")
print(f"  Device: {next(trainer.model.parameters()).device}")

## Notes and Next Steps

### What we've accomplished:
- Implemented REINFORCE policy gradient algorithm
- Trained a Qwen model on Wordle environment
- Evaluated model performance

### Potential improvements:
- Try larger models (Qwen2.5-1.5B, Qwen2.5-3B)
- Increase number of training examples
- Adjust hyperparameters (learning rate, batch size)
- Implement PPO (Proximal Policy Optimization) for more stable training
- Add value function estimation (Actor-Critic methods)
- Use reward shaping for better learning signal

### Memory considerations:
- Policy gradient methods require storing gradients, which can be memory-intensive
- Consider using gradient checkpointing for larger models
- Monitor GPU memory usage during training