# Reinforcement Fine-Tuning with GRPO on WORDLE

This is the level 0 notebook of a series of notebook showcasing the capabilities of Reinforcement Fine-Tuning (RFT). Through carefully designed reward function and SOTA optimization algorithm GRPO, this notebook demonstrates how an RFT tuned model has better reasoning capabilities and has emerging agentic behaviors, e.g. context-aware multi-step reasonings and self-corrections.

We will be using a WORDLE utils which defines the basic structure of the game and HuggingFace TRL's GRPO Trainer.

In [1]:
%%capture
!pip install -r requirements.txt --no-cache-dir 

In [2]:
import sys, subprocess

print("Python:", sys.version)
import torch
print("Torch:", torch.__version__, "CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0))


Python: 3.11.11 (main, Jun 20 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)]
Torch: 2.7.0+cu126 CUDA available: True
Device: NVIDIA A100-SXM4-40GB


In [3]:
import importlib, utils

# keep existing objects
tok = utils._get_tokenizer()
mdl = utils._get_model()

# reload edited code
importlib.reload(utils)

# reattach cached objects to the reloaded module
utils._tokenizer = tok
utils._model = mdl

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## Define Wordle

The Utils file implements the necessary helper functions for Wordle.

In [11]:
# Imports and helpers
from utils import (
    GuessWithFeedback,
    get_messages,
    render_prompt,
    generate,
    generate_stream,
    next_turn,
    get_feedback,
    train_wordle_grpo,
)

secret_word = "BRICK"  # change as desired
past_guesses = []

print("Secret word set (for reward function):", secret_word)


Secret word set (for reward function): BRICK


An example of our base model playing a few turns of Wordle

In [10]:
past_guesses = []
adapter_id = "wordle-dlai/2"

In [12]:
next_turn(past_guesses, secret_word, adapter_id)

For the initial guess, I'll choose a common 5-letter word that covers a variety of letters. A good starting point could be "CRANE" since it includes vowels and consonants.</think>

<guess> CRANE </guess>




----------------------------------------------------------------------------------------------------

CRANE → Feedback: C(-) R(✓) A(x) N(x) E(x)


In [13]:
next_turn(past_guesses, secret_word, adapter_id)

From the feedback, we know that 'R' is correct and in the right position. The other letters (C, A, N, E) are not part of the secret word. So, we should avoid these letters in our next guesses. Let's try a guess that includes 'R' and avoids C, A, N, E.</think>

<guess>DRIVE</guess>




----------------------------------------------------------------------------------------------------

CRANE → Feedback: C(-) R(✓) A(x) N(x) E(x)
DRIVE → Feedback: D(x) R(✓) I(✓) V(x) E(x)


In [14]:
next_turn(past_guesses, secret_word, adapter_id)

From the feedback:
- Guess 1 (CRANE): R is correct and in the right place; A, N, and E are not in the word.
- Guess 2 (DRIVE): R is correct and in the right place; I is correct and in the right place; D and V are not in the word.

The word must contain 'R' and 'I'. The letters 'A', 'N', 'E', 'D', and 'V' are excluded. The other letters in the guesses are either wrong or in the wrong positions.

Considering these constraints, let's try a word that includes 'R' and 'I' and excludes the other letters.</think>
<guess> FIRE </guess>




----------------------------------------------------------------------------------------------------

CRANE → Feedback: C(-) R(✓) A(x) N(x) E(x)
DRIVE → Feedback: D(x) R(✓) I(✓) V(x) E(x)
FIRE → Feedback: F(x) I(-) R(-) E(x)


In [15]:
next_turn(past_guesses, secret_word, adapter_id)

From the previous guesses, we know:
- 'R' is in the correct position (2nd).
- 'I' and 'E' are not in the word at all.
- 'F' is also not in the word.
- 'C', 'A', 'N', 'D', 'V' are also not in the word based on their positions or non-inclusion.

This leaves us with letters like 'L', 'M', 'T', 'S', 'B', 'G', etc., which haven't been tested yet. Given that 'R' and 'I' are confirmed, let's test another combination ensuring these constraints.</think>

<guess>BRIDE</guess>




----------------------------------------------------------------------------------------------------

CRANE → Feedback: C(-) R(✓) A(x) N(x) E(x)
DRIVE → Feedback: D(x) R(✓) I(✓) V(x) E(x)
FIRE → Feedback: F(x) I(-) R(-) E(x)
BRIDE → Feedback: B(✓) R(✓) I(✓) D(x) E(x)


In [16]:
next_turn(past_guesses, secret_word, adapter_id)

From the feedback:
1. Guess 1: CRANE (R is correct, in the second position; all other letters are incorrect).
2. Guess 2: DRIVE (R is correct, in the second position; I is correct, in the third position; all other letters are incorrect).
3. Guess 3: FIRE (I is in the word but not in the right position; R is not in the word; E is correct, in the fifth position).
4. Guess 4: BRIDE (B and R are correct, in the first and second positions respectively; I is correct, in the third position; D is not in the word).

Combining these, we can deduce:
- R is in the second position.
- I is in the third position.
- E is in the fifth position.
- B is in the first position.
- We need to determine the fourth letter which is not D (from BRIDE) and should fit with the given constraints.

Possible candidates for the fourth letter could be from the remaining alphabet, but let's test a common word that fits these criteria.</think>
<guess>BRIST</guess>




-----------------------------------------------------

In [17]:
next_turn(past_guesses, secret_word, adapter_id)

 
- From Guess 1, we know R is in the correct position (2nd).
- From Guess 2, we know R is still correct, and I is in the correct position (3rd).
- From Guess 3, E is the only consistent letter, so it's in the correct position (5th).
- From Guess 4, B and R are confirmed as the 1st and 2nd letters respectively, and I is confirmed as the 3rd letter.
- From Guess 5, S is in the correct position (4th).

So far, we have B R I S _.

Now, let's check for any missing letters. The letters used so far are B, R, I, S, and E. We need to consider the remaining letters that could fit the pattern.

The word must contain the letters B, R, I, S, and E, and should not contain any other letters from the previous guesses. Let's try to think of common 5-letter words fitting this pattern.

</think>
<guess> BRISE </guess>




----------------------------------------------------------------------------------------------------

CRANE → Feedback: C(-) R(✓) A(x) N(x) E(x)
DRIVE → Feedback: D(x) R(✓) I(✓) V(x) E

## Train with TRL GRPO

We run a short GRPO session to improve performance on the Wordle-style task. The reward function parses the model's `<guess>...</guess>` and rewards correct-position matches plus an exact-match bonus.

Key defaults:
- loss_type: BNPO
- beta: 0.0 (no separate ref model)
- mask_truncated_completions: True
- scale_rewards: False (per recent analysis)

See the [GRPO Trainer docs](https://huggingface.co/docs/trl/main/en/grpo_trainer) for more configuration tips.


In [None]:
# This may take several minutes depending on steps and GPU
trainer = train_wordle_grpo(
    secret_word=secret_word,
    past_guesses=past_guesses,
    model_id="Qwen/Qwen2.5-7B-Instruct",  # change if you want another compatible chat model
    output_dir="/Users/anxie/Documents/cookbooks/wordle-grpo-output",
    max_steps=200,  # increase for better results on a capable GPU
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    num_generations_per_prompt=8,
    max_prompt_length=512,
    max_completion_length=64,
    beta=0.0,
    learning_rate=1e-6,
    seed=42,
    generation_batch_size=8,
)

trainer.save_model()  # save adapter or full model depending on backend


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
# Evaluate after GRPO training
from utils import render_prompt, extract_guess, get_feedback, GuessWithFeedback

# Use the trainer's active model for generation to see improvements.
trained_model = trainer.model
trained_tokenizer = trainer.tokenizer if hasattr(trainer, "tokenizer") else None

# Prepare a fresh prompt (same past_guesses context)
prompt = render_prompt(past_guesses)

import torch
transformers = __import__("transformers")

inputs = (trained_tokenizer or __import__("transformers").AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct", trust_remote_code=True)).__call__(
    prompt, return_tensors="pt"
).to(trained_model.device)

with torch.no_grad():
    output_ids = trained_model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        eos_token_id=(trained_tokenizer or __import__("transformers").AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct", trust_remote_code=True)).eos_token_id,
    )

text = (trained_tokenizer or __import__("transformers").AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct", trust_remote_code=True)).decode(
    output_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
print(text)

# Extract and score the guess
post_guess = extract_guess(text)
print("\nPost-training guess:", post_guess)
print("Feedback:", GuessWithFeedback.from_secret(post_guess, secret_word))


## Save and reload (optional)

Below shows how to push the result to the output directory and reload the model for inference later. Depending on your backend and config, `trainer.save_model()` may save adapters or a full model.



In [None]:
# Reload for inference (if needed later)
from transformers import AutoModelForCausalLM, AutoTokenizer

reload_dir = "/Users/anxie/Documents/cookbooks/wordle-grpo-output"
print("Reloading from:", reload_dir)

reload_tok = AutoTokenizer.from_pretrained(reload_dir, trust_remote_code=True)
reload_model = AutoModelForCausalLM.from_pretrained(reload_dir, device_map="auto", trust_remote_code=True)

prompt = render_prompt(past_guesses)
inputs = reload_tok(prompt, return_tensors="pt").to(reload_model.device)
with torch.no_grad():
    out_ids = reload_model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.95)
print(reload_tok.decode(out_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))
