Skip to content

wch18/exp_RL

Repository files navigation

RL for LLMs — Learning Repository

A structured 10-week learning journey from RL fundamentals to frontier agentic RL for large language models.

Repository Structure

exp_RL/
├── docs/                          # Conceptual notes & Q&A insights
│   ├── 01_foundations/            # MDP, policy gradient, RLHF pipeline
│   ├── 02_algorithms/            # PPO, DPO, GRPO, DAPO deep dives
│   ├── 03_frameworks/            # TRL, verl, SGLang setup guides
│   ├── 04_agentic_rl/            # Multi-turn RL, tool use, POMDPs
│   ├── 05_code_rl/               # Code RL with verifiable rewards
│   └── 06_monitoring/            # W&B dashboards, health signals
├── experiments/                   # Hands-on experiments (Weeks 4–10)
│   ├── exp1_grpo_math/           # Hello World GRPO on GSM8K
│   ├── exp2_dapo_math/           # DAPO tricks ablation
│   ├── exp3_code_rl/             # Code RL with test feedback
│   └── exp4_agentic_rl/          # Multi-turn agentic RL
├── scripts/                       # Environment setup & data utils
├── reading_notes/                 # Paper reading tracker & notes
├── Learning_map.md                # Full learning roadmap (source doc)
└── Framework_discussion.md        # Framework comparison (source doc)

10-Week Roadmap

Week Focus Key Deliverable Docs Experiment
1–2 Conceptual foundations Notes on RLHF/GRPO/DPO/DAPO docs/01_foundations/, docs/02_algorithms/
3 Framework setup Working environment on H100s docs/03_frameworks/
4 GRPO on math First RL training run + W&B dashboard docs/06_monitoring/ exp1_grpo_math/
5–6 DAPO on math GRPO vs DAPO comparison report docs/02_algorithms/dapo.md exp2_dapo_math/
6–7 Agentic RL deep dive Literature notes docs/04_agentic_rl/
7–8 Code RL with test feedback Code RL pipeline + pass@k curves docs/05_code_rl/ exp3_code_rl/
9–10 Multi-turn agentic RL Agentic RL prototype docs/04_agentic_rl/ exp4_agentic_rl/
10 Synthesis Final comparison report

Framework Strategy

Phase 1 (Weeks 1–5): TRL — lowest barrier, best docs, GRPOTrainer in ~50 lines.

Phase 2 (Weeks 5–10): verl + SGLang — research flexibility, deterministic inference, multi-turn support.

See Framework_discussion.md for the full comparison.

Quick Start

# 1. Set up environment
bash scripts/setup_env.sh

# 2. Download datasets
bash scripts/download_data.sh

# 3. Run first experiment (Week 4)
cd experiments/exp1_grpo_math
python train_grpo.py

Hardware Assumption

All experiments designed for 1–4× H100 80GB. See each experiment's README for specific GPU requirements.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors