RL for LLMs — Learning Repository

A structured 10-week learning journey from RL fundamentals to frontier agentic RL for large language models.

Repository Structure

exp_RL/
├── docs/                          # Conceptual notes & Q&A insights
│   ├── 01_foundations/            # MDP, policy gradient, RLHF pipeline
│   ├── 02_algorithms/            # PPO, DPO, GRPO, DAPO deep dives
│   ├── 03_frameworks/            # TRL, verl, SGLang setup guides
│   ├── 04_agentic_rl/            # Multi-turn RL, tool use, POMDPs
│   ├── 05_code_rl/               # Code RL with verifiable rewards
│   └── 06_monitoring/            # W&B dashboards, health signals
├── experiments/                   # Hands-on experiments (Weeks 4–10)
│   ├── exp1_grpo_math/           # Hello World GRPO on GSM8K
│   ├── exp2_dapo_math/           # DAPO tricks ablation
│   ├── exp3_code_rl/             # Code RL with test feedback
│   └── exp4_agentic_rl/          # Multi-turn agentic RL
├── scripts/                       # Environment setup & data utils
├── reading_notes/                 # Paper reading tracker & notes
├── Learning_map.md                # Full learning roadmap (source doc)
└── Framework_discussion.md        # Framework comparison (source doc)

10-Week Roadmap

Week	Focus	Key Deliverable	Docs	Experiment
1–2	Conceptual foundations	Notes on RLHF/GRPO/DPO/DAPO	`docs/01_foundations/`, `docs/02_algorithms/`	—
3	Framework setup	Working environment on H100s	`docs/03_frameworks/`	—
4	GRPO on math	First RL training run + W&B dashboard	`docs/06_monitoring/`	`exp1_grpo_math/`
5–6	DAPO on math	GRPO vs DAPO comparison report	`docs/02_algorithms/dapo.md`	`exp2_dapo_math/`
6–7	Agentic RL deep dive	Literature notes	`docs/04_agentic_rl/`	—
7–8	Code RL with test feedback	Code RL pipeline + pass@k curves	`docs/05_code_rl/`	`exp3_code_rl/`
9–10	Multi-turn agentic RL	Agentic RL prototype	`docs/04_agentic_rl/`	`exp4_agentic_rl/`
10	Synthesis	Final comparison report	—	—

Framework Strategy

Phase 1 (Weeks 1–5): TRL — lowest barrier, best docs, GRPOTrainer in ~50 lines.

Phase 2 (Weeks 5–10): verl + SGLang — research flexibility, deterministic inference, multi-turn support.

See Framework_discussion.md for the full comparison.

Quick Start

# 1. Set up environment
bash scripts/setup_env.sh

# 2. Download datasets
bash scripts/download_data.sh

# 3. Run first experiment (Week 4)
cd experiments/exp1_grpo_math
python train_grpo.py

Hardware Assumption

All experiments designed for 1–4× H100 80GB. See each experiment's README for specific GPU requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
docs		docs
experiments		experiments
reading_notes		reading_notes
scripts		scripts
.gitignore		.gitignore
Framework_discussion.md		Framework_discussion.md
Framework_discussion_zh.md		Framework_discussion_zh.md
Learning_map.md		Learning_map.md
Learning_map_zh.md		Learning_map_zh.md
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL for LLMs — Learning Repository

Repository Structure

10-Week Roadmap

Framework Strategy

Quick Start

Hardware Assumption

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RL for LLMs — Learning Repository

Repository Structure

10-Week Roadmap

Framework Strategy

Quick Start

Hardware Assumption

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages