A structured 10-week learning journey from RL fundamentals to frontier agentic RL for large language models.
exp_RL/
├── docs/ # Conceptual notes & Q&A insights
│ ├── 01_foundations/ # MDP, policy gradient, RLHF pipeline
│ ├── 02_algorithms/ # PPO, DPO, GRPO, DAPO deep dives
│ ├── 03_frameworks/ # TRL, verl, SGLang setup guides
│ ├── 04_agentic_rl/ # Multi-turn RL, tool use, POMDPs
│ ├── 05_code_rl/ # Code RL with verifiable rewards
│ └── 06_monitoring/ # W&B dashboards, health signals
├── experiments/ # Hands-on experiments (Weeks 4–10)
│ ├── exp1_grpo_math/ # Hello World GRPO on GSM8K
│ ├── exp2_dapo_math/ # DAPO tricks ablation
│ ├── exp3_code_rl/ # Code RL with test feedback
│ └── exp4_agentic_rl/ # Multi-turn agentic RL
├── scripts/ # Environment setup & data utils
├── reading_notes/ # Paper reading tracker & notes
├── Learning_map.md # Full learning roadmap (source doc)
└── Framework_discussion.md # Framework comparison (source doc)
| Week | Focus | Key Deliverable | Docs | Experiment |
|---|---|---|---|---|
| 1–2 | Conceptual foundations | Notes on RLHF/GRPO/DPO/DAPO | docs/01_foundations/, docs/02_algorithms/ |
— |
| 3 | Framework setup | Working environment on H100s | docs/03_frameworks/ |
— |
| 4 | GRPO on math | First RL training run + W&B dashboard | docs/06_monitoring/ |
exp1_grpo_math/ |
| 5–6 | DAPO on math | GRPO vs DAPO comparison report | docs/02_algorithms/dapo.md |
exp2_dapo_math/ |
| 6–7 | Agentic RL deep dive | Literature notes | docs/04_agentic_rl/ |
— |
| 7–8 | Code RL with test feedback | Code RL pipeline + pass@k curves | docs/05_code_rl/ |
exp3_code_rl/ |
| 9–10 | Multi-turn agentic RL | Agentic RL prototype | docs/04_agentic_rl/ |
exp4_agentic_rl/ |
| 10 | Synthesis | Final comparison report | — | — |
Phase 1 (Weeks 1–5): TRL — lowest barrier, best docs, GRPOTrainer in ~50 lines.
Phase 2 (Weeks 5–10): verl + SGLang — research flexibility, deterministic inference, multi-turn support.
See Framework_discussion.md for the full comparison.
# 1. Set up environment
bash scripts/setup_env.sh
# 2. Download datasets
bash scripts/download_data.sh
# 3. Run first experiment (Week 4)
cd experiments/exp1_grpo_math
python train_grpo.pyAll experiments designed for 1–4× H100 80GB. See each experiment's README for specific GPU requirements.