# Training LLMs in ANY Environment with OpenEnv

## 🎯 The Vision

Imagine training language models in:
- 🎰 **Card games** (BlackJack, Poker, Uno)
- ♟️ **Board games** (Chess, Go, Connect Four)
- 📈 **Trading simulations** (realistic market environments)
- 🎮 **Atari games** (Pong, Breakout, Space Invaders)
- 💻 **Code execution environments** (interactive debugging)
- 🤖 **Robotics simulations** (MuJoCo, PyBullet)

---

### The Problem

Every RL environment has different APIs:
- ❌ OpenSpiel uses C++ bindings
- ❌ Atari needs ALE (Arcade Learning Environment)
- ❌ Trading sims have custom interfaces
- ❌ Each requires different dependencies, versions, OS compatibility
- ❌ No isolation → crashes can corrupt your system

**You spend more time wrestling with environments than training models.**

---

### The Solution: OpenEnv - A Universal Spec

<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 25px; border-radius: 10px; color: white; margin: 20px 0;'>
    <h3 style='margin-top: 0;'>🚀 OpenEnv = Universal RL Environment Interface</h3>
    <p style='font-size: 18px; line-height: 1.8;'>
        <b>OpenEnv is not a game engine.</b><br>
        It's a <b>specification</b> that wraps ANY RL environment with a clean, unified API.
    </p>
    <ul style='font-size: 16px; line-height: 1.8;'>
        <li><b>70+ environments</b> (OpenSpiel, Atari, FinRL, and more)</li>
        <li><b>Unified Simplified API:</b> <code>reset()</code>, <code>step(action)</code>, <code>state()</code></li>
        <li><b>HTTP-based</b> → language-agnostic (Python, Rust, JavaScript, anything)</li>
        <li><b>Docker-isolated</b> → reproducible, secure, no dependency hell</li>
    </ul>
    <p style='font-size: 16px; margin-top: 15px;'>
        <b>One interface. Any environment. Zero setup.</b>
    </p>
</div>

---

## What You'll Build

In this tutorial, you'll:
1. 🔌 **Explore OpenEnv** - Connect to BlackJack, see how the spec works
2. 🎲 **Benchmark policies** - Test random vs heuristic strategies
3. 🧠 **Learn about GRPO** - Brief intro to the training algorithm
4. ⚡ **Train with Forge** - Use PyTorch's agentic RL library
5. 📊 **Compare results** - Measure improvement
6. 🔄 **Switch environments** - Show how to train on different games

**This uses production code.** Same implementation as `apps/grpo/blackjack_main_fixed.py`.

---

### 📚 Resources
- 📦 [OpenEnv GitHub](https://github.com/meta-pytorch/OpenEnv) - Universal RL environment spec
- 📄 [GRPO Paper (arXiv:2402.03300)](https://arxiv.org/abs/2402.03300) - Group Relative Policy Optimization
- 🔧 [Forge GitHub](https://github.com/meta-pytorch/torchforge) - PyTorch-native agentic RL library
- 📖 [Forge Docs](https://meta-pytorch.org/torchforge/) - Full documentation

## 🔌 Part 1: Exploring OpenEnv

Let's connect to a BlackJack environment and explore the OpenEnv spec.

### Start the Server

<div style='background: #fff3cd; padding: 15px; border-radius: 8px; border-left: 5px solid #ffc107; margin: 20px 0;'>
    <b>⚠️ Note:</b> Start the OpenEnv server in a separate terminal:
    <pre style='margin-top: 10px; background: white; padding: 10px; border-radius: 5px;'>
# Set your OpenEnv path
export OPENENV_PATH="/path/to/OpenEnv/src"
export PYTHONPATH="${OPENENV_PATH}:${PYTHONPATH}"

# Start BlackJack server
OPENSPIEL_GAME=blackjack python -m envs.openspiel_env.server.app --port 8004</pre>
</div>

In [None]:
# Environment setup for Jupyter
import sys
import os

# Fix for Monarch/Torchstore Rust bindings in Jupyter
conda_prefix = os.environ.get('CONDA_PREFIX', sys.prefix)
lib_path = f"{conda_prefix}/lib"

if 'LD_LIBRARY_PATH' in os.environ:
    if lib_path not in os.environ['LD_LIBRARY_PATH']:
        os.environ['LD_LIBRARY_PATH'] = f"{lib_path}:{os.environ['LD_LIBRARY_PATH']}"
else:
    os.environ['LD_LIBRARY_PATH'] = lib_path

print("✅ Environment configured")

### Connect to OpenEnv

Let's connect to the BlackJack environment and explore its interface.

In [None]:
import sysimport osfrom pathlib import Path# Add OpenEnv to path (update this to your OpenEnv installation)openenv_path = os.environ.get('OPENENV_PATH', '/path/to/OpenEnv/src')if openenv_path not in sys.path:    sys.path.insert(0, openenv_path)from envs.openspiel_env import OpenSpielEnv, OpenSpielActionfrom grpo_utils import show_openenv_observation# Connect to environmentenv = OpenSpielEnv(base_url="http://localhost:8004")print("🎰 Connected to BlackJack environment")print("\n📍 Resetting environment...\n")# Reset and observeresult = env.reset()show_openenv_observation(result.observation)env.close()print("\n✅ OpenEnv interface exploration complete!")

### What Just Happened?

You just saw the **OpenEnv spec** in action:

```python
# Universal interface - works for ANY environment
result = env.reset()              # Start episode
result = env.step(action)         # Take action
state = env.state()               # Get environment state
env.close()                       # Cleanup
```

**Key observations:**
- `legal_actions`: What actions the agent can take
- `info_state`: Numeric observation vector
- `game_phase`: Current phase of the game
- `reward`: Outcome (+1 win, -1 loss, 0 push)

This same interface works for **70+ different environments**. Change the server, everything else stays the same!

## 🎲 Part 2: Benchmarking Baseline Policies

Before training an LLM, let's see how simple policies perform.

In [None]:
from grpo_utils import play_random_policyprint("🎲 Running random policy baseline...\n")# Play 100 games with random actionsstats = play_random_policy("http://localhost:8004", num_games=100)print("\n📊 Random Policy Results:")print(f"   Games played: {stats['total_games']}")print(f"   Wins: {stats['wins']}")print(f"   Losses: {stats['losses']}")print(f"   Pushes: {stats['pushes']}")print(f"   Win rate: {stats['win_rate']:.1%}")print("\n📝 Note: Optimal BlackJack strategy achieves ~43% win rate")

### The Challenge

Random policy performs poorly (~30-35% win rate).

**Can we train an LLM to do better?**

That's where **GRPO** comes in.

## 🧠 Part 3: Understanding Reinforcement Learning & GRPO

<div style='background: linear-gradient(135deg, #e66465 0%, #9198e5 100%); padding: 25px; border-radius: 10px; color: white; margin: 20px 0; border: 3px solid #fff;'>
    <h3 style='margin-top: 0;'>📚 Section Inspired by Unsloth</h3>
    <p style='font-size: 16px; line-height: 1.8;'>
        This section is heavily inspired by the excellent <a href='https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide' style='color: #fff; text-decoration: underline;'><b>Unsloth RL Guide</b></a>.
        <br><br>
        Unsloth has done an amazing job making RL accessible and intuitive. We highly recommend reading their full guide for deeper insights and practical tips!
        <br><br>
        🙏 <b>Big thanks to the Unsloth team</b> for their educational approach to RL.
    </p>
</div>

---

### What is Reinforcement Learning?

<div style='background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 5px solid #6c757d; margin: 20px 0;'>
    <h4 style='margin-top: 0;'>The Core Idea (It's Simpler Than You Think!)</h4>
    <p style='font-size: 16px; line-height: 1.8;'>
        The goal of RL is extremely simple:
    </p>
    <ul style='font-size: 16px; line-height: 1.8;'>
        <li>✅ <b>Increase the chance of seeing "good" outcomes</b></li>
        <li>❌ <b>Decrease the chance of seeing "bad" outcomes</b></li>
    </ul>
    <p style='font-size: 16px; margin-top: 10px;'>
        That's it! Everything else is just details about what "good" and "bad" mean, and how to increase/decrease their probabilities.
    </p>
</div>

#### A Simple Example: Learning "2 + 2 = ?"

Imagine an untrained language model trying to answer "What is 2+2?". It might output:

```
0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31, ...
```

Then suddenly: **4** ✓

The reward signals would be:
```
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... then 1
```

**This is the key insight:** By patience (or "luck"), if the correct answer has *any* non-zero probability, RL will eventually find it. The trick is:
1. While waiting, we learn from **bad answers** → tell model "don't do this"
2. When we find **good answers** → tell model "do more of this"

This is why I like to call it **"Patience Is All You Need"** for RL.

---

### From PPO to GRPO: The Evolution

<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 25px; border-radius: 10px; color: white; margin: 20px 0;'>
    <h4 style='margin-top: 0;'>📜 The Algorithm Evolution</h4>
    
<table style='width: 100%; color: white; margin-top: 15px;'>
<tr>
    <td style='padding: 8px; border-bottom: 1px solid rgba(255,255,255,0.3);'><b>RLHF + PPO</b> (OpenAI ChatGPT)</td>
    <td style='padding: 8px; border-bottom: 1px solid rgba(255,255,255,0.3);'>Needed 3 models: Policy, Reference, Value Model</td>
</tr>
<tr>
    <td style='padding: 8px;'><b>GRPO</b> (DeepSeek R1)</td>
    <td style='padding: 8px;'>Only needs 2 models: Policy + Reference<br>→ <b>Much more efficient!</b></td>
</tr>
</table>
</div>

**What GRPO removes:**
- ❌ **Value Model** → Replaced with group statistics
- ❌ **Reward Model** → Replaced with simple reward functions

**Why this matters:**
- 💾 Less memory usage
- ⚡ Faster training
- 🎯 Easier to implement

---

### GRPO: Group Relative Policy Optimization

<div style='background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); padding: 25px; border-radius: 10px; color: white; margin: 20px 0;'>
    <h4 style='margin-top: 0;'>Why "Group Relative"?</h4>
    <p style='font-size: 16px; line-height: 1.8;'>
        Instead of training a separate Value Model to estimate "how good is this state?", 
        GRPO uses a clever trick: <b>sample the model multiple times</b> and compare answers within the group.
    </p>
</div>

**Example: Training on "What is 2+2?"**

1. **Generate multiple responses** (e.g., 4 samples):
   - Response 1: "4" → reward = +1 (correct!)
   - Response 2: "3" → reward = 0 (close, but wrong)
   - Response 3: "D" → reward = -1 (nonsense)
   - Response 4: "C" → reward = -1 (nonsense)

2. **Calculate group statistics:**
   - Mean reward: (-1 + -1 + 0 + 1) / 4 = -0.25
   - Standard deviation: ~0.83

3. **Compute advantages** (Z-score normalization):
   - Response 1: +1.5 (much better than average!)
   - Response 2: +0.3 (slightly better)
   - Response 3: -0.9 (worse than average)
   - Response 4: -0.9 (worse than average)

4. **Update model:**
   - Increase probability of generating "4"
   - Slightly increase "3" (it's closer than nonsense)
   - Decrease probability of generating "D" and "C"

This is **group-relative** because we're comparing within the group, not to an absolute baseline!

---

### Reward Functions: The Secret Sauce

Reward functions tell the model what's "good" and what's "bad". They can be simple or complex:

**For BlackJack (what we're using):**
```python
def evaluate_response(prompt, response, game_reward):
    reward = float(game_reward)  # +1 (win), -1 (loss), 0 (push)
    
    # Reward shaping: Scale up wins
    if game_reward > 0:
        reward = 2.0  # Wins are more valuable
    elif game_reward == 0:
        reward = 0.5  # Pushes better than losses
    
    return reward
```

**For Math Problems:**
- If answer is a number: +1
- If answer matches ground truth: +3
- If no number detected: -1
- **Total reward:** Sum of all criteria

**For Email Automation:**
- Contains required keyword: +1
- Matches ideal response: +1
- Too long: -1
- Includes recipient name: +1
- Has signature block: +1

The key is: **Reward functions must be verifiable**. You can't subjectively judge "is this creative?" but you can verify "is this answer correct?"

---

### The Training Process (Simplified)

```
1. Play game → Get action "HIT" or "STAND"
   ↓
2. Game ends → Observe reward (+1 win, -1 loss, 0 push)
   ↓
3. Repeat 4-8 times for the same question (group)
   ↓
4. Calculate group statistics (mean, std)
   ↓
5. Compute advantages (which answers were better/worse than average?)
   ↓
6. Update model: increase good action probability, decrease bad
   ↓
7. Repeat thousands of times → Model learns strategy!
```

**Key insight:** Over time, the model learns not just "what to do" but also *why* (the reasoning process). This is how DeepSeek R1 developed its famous `<think>` tokens!

---

### Forge: PyTorch-Native Agentic RL Infrastructure

<div style='background: linear-gradient(135deg, #20c997 0%, #17a2b8 100%); padding: 20px; border-radius: 10px; color: white; margin: 20px 0;'>
    <h4 style='margin-top: 0;'>What is Forge?</h4>
    <p style='font-size: 16px; line-height: 1.6;'>
        <b>Forge</b> is PyTorch's official library for training agentic RL models. It handles all the distributed systems complexity so you can focus on algorithms.
    </p>
    <ul style='font-size: 15px; line-height: 1.7;'>
        <li><b>Generator (vLLM):</b> Fast LLM inference with automatic batching</li>
        <li><b>RLTrainer:</b> Distributed training with FSDP across GPUs</li>
        <li><b>ReplayBuffer:</b> Stores episodes for off-policy learning</li>
        <li><b>ReferenceModel:</b> Keeps original model for KL penalty</li>
        <li><b>Torchstore:</b> Distributed weight management across replicas</li>
    </ul>
</div>

**Resources:**
- 🔧 [GitHub](https://github.com/meta-pytorch/torchforge) - Source code
- 📖 [Documentation](https://meta-pytorch.org/torchforge/) - Full docs
- 📄 [GRPO Paper](https://arxiv.org/abs/2402.03300) - Original research

**In this tutorial:** We abstract all of Forge's complexity. You just call:
```python
trainer = await setup_forge_training("config.yaml")
await trainer.run(steps=100)
```

Everything else happens automatically! 🚀

## 🏗️ Part 4: Training with GRPO

Now let's train a Qwen 1.5B model to play BlackJack using production GRPO code.

### Architecture Overview

```
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃              YOUR TRAINING LOOP                    ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃                                                    ┃
┃  Rollouts Loop          Training Loop             ┃
┃  • Play games           • Sample batch            ┃
┃  • Collect episodes     • Compute loss            ┃
┃  • Compute advantages   • Update weights          ┃
┃  • Add to buffer        • Push to replicas        ┃
┃                                                    ┃
┗━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━┛
           │                         │
      HTTP │                         │ RPC
           │                         │
           ↓                         ↓
   ┏━━━━━━━━━━━━━┓          ┏━━━━━━━━━━━━━━┓
   ┃   OpenEnv   ┃          ┃    Forge     ┃
   ┃   Server    ┃          ┃   Services   ┃
   ┗━━━━━━━━━━━━━┛          ┗━━━━━━━━━━━━━━┛
```

**Two concurrent loops:**
1. **Rollouts:** Play games via OpenEnv → collect episodes
2. **Training:** Sample from buffer → update policy with GRPO

They run in parallel for maximum efficiency!

### Setup and Configuration

In [None]:
from grpo_utils import setup_forge_trainingprint("🏗️ Initializing Forge infrastructure...\n")print("This will:")print("  • Load the Qwen 1.5B model")print("  • Initialize vLLM inference servers")print("  • Setup distributed training (TorchTitan)")print("  • Create replay buffer and reference model")print("\n⏳ This may take 1-2 minutes...\n")# Initialize everything with one function calltrainer = await setup_forge_training("blackjack.yaml")print("\n✅ Ready to train!")

### Run Training

Now we train for 100 steps. This is a shortened demo - production training uses 1000+ steps.

In [None]:
print("🚀 Starting GRPO training!\n")
print("Watch the logs to see:")
print("  • Games being played (with actions and outcomes)")
print("  • Win rate improving over time")
print("  • Training steps updating the policy")
print("\n" + "="*60 + "\n")

# Run training (this is the production training loop!)
results = await trainer.run(steps=100)

print("\n" + "="*60)
print("\n🎉 Training complete!")

### Cleanup

In [None]:
# Shutdown Forge services
await trainer.shutdown()
print("✅ Shutdown complete")

## 🔄 Part 5: The Power of OpenEnv - Switching Environments

Here's the magic: **The same code works for ANY OpenEnv environment.**

### Switch to Tic-Tac-Toe

Just change the server:

```bash
# Terminal:
OPENSPIEL_GAME=tic_tac_toe python -m envs.openspiel_env.server.app --port 8005
```

Update config:
```python
cfg.blackjack_env.server_url = "http://localhost:8005"
```

**Everything else stays identical.** Same GRPO code, same Forge infrastructure.

---

### Switch to Chess

```bash
OPENSPIEL_GAME=chess python -m envs.openspiel_env.server.app --port 8006
```

Update model and config for longer sequences, done!

---

### Switch to Atari

```bash
# Different OpenEnv backend
python -m envs.atari_env.server.app --game pong --port 8007
```

Modify prompt formatting for vision inputs, same training loop!

---

<div style='background: #d1ecf1; padding: 20px; border-radius: 10px; border-left: 5px solid #0c5460; margin: 20px 0;'>
    <h3 style='color: #0c5460; margin-top: 0;'>💡 The Key Insight</h3>
    <p style='color: #0c5460; font-size: 16px;'>
        <b>OpenEnv is a spec, not a game engine.</b><br><br>
        Once you have a training loop that talks to OpenEnv, you can train on ANY environment that implements the spec.
        <br><br>
        Change one environment variable → train on 70+ different environments.
    </p>
</div>

## 🚀 Next Steps

### 1. Scale Up Training

Edit `apps/grpo/blackjack.yaml`:

```yaml
trainer:
  training:
    steps: 1000          # More training steps

group_size: 8            # More games per rollout
rollout_threads: 4       # Parallel rollout collection
```

Run from command line for serious training:

```bash
python -m apps.grpo.blackjack_main_fixed --config apps/grpo/blackjack.yaml
```

---

### 2. Explore Other Environments

Try different OpenSpiel games:
- `OPENSPIEL_GAME=tic_tac_toe`
- `OPENSPIEL_GAME=connect_four`
- `OPENSPIEL_GAME=go`

Explore other OpenEnv backends:
- Atari environments
- FinRL trading simulations
- Custom environments

---

### 3. Customize the Training

All the code is in `apps/grpo/grpo_utils.py`:
- Modify reward shaping in `BlackJackReward.evaluate_response()`
- Adjust advantage computation in `ComputeAdvantages.compute()`
- Tweak GRPO loss hyperparameters (beta, KL penalty)
- Change prompt formatting in `format_prompt()`

---

## 📚 Resources

### OpenEnv
- 📦 [GitHub](https://github.com/meta-pytorch/OpenEnv) - Source code and examples
- 📖 [Spec Documentation](https://github.com/meta-pytorch/OpenEnv#spec) - Full API reference

### GRPO
- 📄 [Paper (arXiv:2402.03300)](https://arxiv.org/abs/2402.03300) - Original publication
- 🔬 [Blog Post](https://ai.meta.com/blog/grpo/) - High-level explanation

### Forge
- 🔧 [GitHub](https://github.com/meta-pytorch/torchforge) - PyTorch-native agentic RL
- 📖 [Docs](https://meta-pytorch.org/torchforge/) - Full documentation
- 💬 [Discussions](https://github.com/meta-pytorch/torchforge/discussions) - Community support

---

## 🎓 Key Takeaways

<div style='background: #d4edda; padding: 25px; border-radius: 10px; border-left: 5px solid #28a745; margin: 20px 0;'>
    <h3 style='color: #155724; margin-top: 0;'>What You Learned</h3>
    <ol style='color: #155724; font-size: 16px; line-height: 1.8;'>
        <li><b>OpenEnv is a universal spec</b> for RL environments - not just games, ANY interactive environment.</li>
        <li><b>One training loop works everywhere</b> - switch environments by changing a URL.</li>
        <li><b>Forge abstracts distributed RL complexity</b> - focus on algorithms, not infrastructure.</li>
        <li><b>GRPO enables stable LLM training</b> - group-relative advantages + KL penalties work.</li>
        <li><b>Production code is accessible</b> - this notebook uses the same code as large-scale training.</li>
    </ol>
</div>

---

<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 30px; border-radius: 10px; color: white; margin: 30px 0; text-align: center;'>
    <h2 style='margin-top: 0;'>🎉 Congratulations!</h2>
    <p style='font-size: 18px; line-height: 1.8;'>
        You just trained an LLM using production GRPO code.<br>
        You explored OpenEnv as a universal RL interface.<br>
        You saw how Forge abstracts distributed training complexity.
    </p>
    <p style='font-size: 20px; margin-top: 20px;'>
        <b>Now go train agents in ANY environment! 🚀</b>
    </p>
</div>