A comprehensive showcase repository for state-of-the-art autonomous driving research, covering end-to-end planning, vision-language-action agents, world models, and reinforcement learning in simulation β built from scratch with real nuScenes data and pretrained model integration.
Vision-Language-Action agent driving in MetaDrive 3D simulation. The agent receives a first-person camera view + a language command, and outputs steering and throttle.
This repo demonstrates 5 different paradigms for autonomous driving, each with working code, training pipelines, and visualization:
| # | Module | Paradigm | Data Source | Status |
|---|---|---|---|---|
| 1 | End-to-End Planner | UniAD-style perception β planning | nuScenes mini | Trained, demo works |
| 2 | VLA Agent (real data) | DriveVLM + Qwen2.5-VL reasoning | nuScenes mini | Trained, demo works |
| 3 | World Model | BEV VAE + Vista pretrained | nuScenes mini + Vista | Working with Vista |
| 4 | RL Simulation | PPO/DQN in highway-env + MetaDrive | Synthetic | Trained, demo works |
| 5 | VLA RL Sim | Vision + Language + Action in 3D sim | Synthetic | Trained, demo + video |
git clone https://github.com/wusimo/ad-world-models.git
cd ad-world-models
pip install -e ".[dev]"Then jump to any module below β each is self-contained.
A unified transformer that does detection β motion forecasting β planning in a single forward pass on multi-camera images.
Architecture:
6 Cameras β ResNet50 β Lift-Splat-Shoot BEV (256Γ200Γ200)
β DETR Detection (300 queries)
β Motion Forecasting (6 modes Γ 6 timesteps per agent)
β Planning (collision-aware GRU β 6 waypoints)
Training: L1 loss on ego trajectory. The gradient flows back through all heads, giving even the detection module useful supervision from the planning objective.
Results: After training on nuScenes mini (~8 min on RTX 3090), the predicted trajectory (green) closely follows the ground truth (white) ~17m forward.
# Download nuScenes mini (~4GB, requires free registration at nuscenes.org)
python scripts/download_nuscenes_mini.py --dataroot ./data/nuscenes
# Train (~8 min)
python scripts/train_e2e_planner.py
# Demo
python -m src.e2e_planner.demoOutput: outputs/e2e_planner/e2e_planner_output.png β 3 panels showing GT scene with LiDAR + boxes, model detections + BEV heatmap, and trajectory comparison.
π Reference: UniAD (CVPR 2023 Best Paper)
A Vision-Language-Action agent that uses Qwen2.5-VL-3B (a real pretrained VLM) for Chain-of-Thought scene reasoning, combined with a trained trajectory decoder for waypoint planning.
Three-panel demo:
- Left: Front camera input from nuScenes
- Center: Real Chain-of-Thought reasoning from Qwen2.5-VL β scene description, critical objects, behavior prediction, ego decision
- Right: GT trajectory (white) vs VLA-predicted trajectory (green, ~24m forward, matches GT)
Architecture:
Camera Image β Qwen2.5-VL-3B β CoT Reasoning (text)
β
Multi-Camera β BEV β Visual Projector β LM hidden state
β
Trajectory Decoder β 6 waypoints
Sample VLM output:
- SCENE: Multi-lane urban street with traffic lights, clear weather, modern buildings
- CRITICAL OBJECTS: White car approaching from left lane, black SUV from right, bus and truck further ahead
- PREDICTIONS: White car likely continues straight or turns left at intersection
- DECISION: Maintain current speed approaching intersection
# Train trajectory decoder (~5 min)
python scripts/train_vla_agent.py
# Demo (downloads Qwen2.5-VL-3B on first run, ~7GB)
python -m src.vla_agent.demoπ Reference: DriveVLM
Two world model implementations side-by-side: a small BEV-space VAE trained from scratch and the pretrained Vista (NeurIPS 2024) for photorealistic future prediction.
Each row shows a different driving action β the Vista model generates photorealistic future frames conditioned on the trajectory:
- Go Straight: Scene progresses forward, traffic and road markings evolve naturally
- Turn Left: Scene drifts rightward as ego turns left
- Turn Right: Scene drifts leftward as ego turns right
Top: input front camera image Middle: Our BEV world model β abstract 256-channel feature maps (the "star" pattern is the projection of 6 cameras into BEV space) Bottom: Vista's photorealistic pixel-space prediction
Top row: Real nuScenes ground-truth frames Bottom row: Vista's imagined future from a single input image
# Train BEV world model (~3 min)
python scripts/precompute_bev.py
python scripts/train_world_model.py
# Set up Vista (clone repo + download 9.4GB weights)
cd ..
git clone https://github.com/OpenDriveLab/Vista.git
python -c "from huggingface_hub import hf_hub_download; hf_hub_download('OpenDriveLab/Vista', 'vista.safetensors', local_dir='Vista/ckpts')"
# Run Vista for action scenarios (each ~5 min on 24GB VRAM)
cd ad-world-models
for action in free straight left right; do
python scripts/run_vista_single.py --action $action \
--output_dir outputs/vista_scenarios/$action --n_steps 5
done
# Run combined demo
python -m src.world_model.demoπ Reference: Vista (NeurIPS 2024), GenAD (CVPR 2024)
Two simulators with PPO/DQN training and visualization:
Top row: random agent crashes early (red ego vehicle) Bottom row: trained DQN agent stays in lane through 60+ steps Reward: 7.7 (random) β 46.8 (trained)
T-intersection with cross traffic. Random agent crashes immediately (0 reward), trained PPO agent navigates through safely (4.0 reward).
3D top-down rendering showing ego (yellow) navigating through traffic (green vehicles) on a multi-lane highway. PPO with CnnPolicy trained on bird's-eye observation. Reward: 9.8 β 135.9 (50K steps).
# highway-env (lightweight, all 4 scenarios)
python -m src.rl_sim.train --scenario all
SDL_VIDEODRIVER=offscreen python -m src.rl_sim.demo --scenario all
# MetaDrive (3D, more realistic)
python -m src.rl_sim.metadrive_train --scenario highway --timesteps 50000
python -m src.rl_sim.metadrive_demo --scenario highwayπ References: highway-env, MetaDrive
Combining vision (3D first-person camera) + language (driving commands) + action (steering/throttle) β the VLA agent is trained in MetaDrive's 3D world via Imitation Learning + PPO RL.
- 3D Camera View (green border): What the VLA sees β first-person camera with road, lanes, sky, mountains, traffic
- Top-Down Context (blue border): Ego (green) and surrounding vehicles in the world
- Language Command: "Drive forward and maintain your lane."
- VLA Action: Steering wheel (rotated by steer angle) + throttle gauge (height = acceleration)
- Curves: Cumulative reward + steering/throttle history
Selected frames from the auto-generated video showing the VLA agent driving through 200 steps. The full MP4 (~6MB) and GIF (~22MB) are generated by src/vla_sim/video.py.
Architecture:
3D RGB Camera (180Γ320) β CNN Encoder (256d)
β
Language Command (id) β Embedding (64d)
β
Concatenate β MLP β Action (steer, accel)
Training pipeline:
- IL from IDM lane-keeping expert (~20 episodes, ~3000 samples)
- 20 epochs of imitation learning (loss 0.0377 β 0.0009)
- Result: reward 66.4 Β± 19.9 in highway scenario
# Train (3 min on RTX 3090)
python scripts/train_vla_3d.py
# Static demo image
python -m src.vla_sim.demo_3d
# Generate MP4 + GIF video
python -m src.vla_sim.video --num_episodes 3 --max_steps 200 --fps 20# Watch the trained VLA agent drive (opens 3D Panda3D window)
python scripts/interactive_metadrive.py --mode vla --map SSS
# Drive yourself with W/A/S/D
python scripts/interactive_metadrive.py --mode manual --map SSS
# Watch the IDM expert
python scripts/interactive_metadrive.py --mode idm --map SSSAvailable maps: SSS (highway), SCrCSC (curved), O (roundabout), XOX (intersection), 3 or 5 (procedural cities). Requires a display server.
For faster training, we also provide a 2D top-down BEV variant (no first-person camera). Trained with IL + PPO RL fine-tuning, this achieves reward 141.0 vs 5.8 random.
# Train 2D version (IL + RL, ~5 min)
python -m src.vla_sim.train --mode il+rl
python -m src.vla_sim.demoad-world-models/
βββ configs/ # YAML configs for each model
βββ scripts/ # Training and data scripts
β βββ download_nuscenes_mini.py
β βββ precompute_bev.py # BEV cache for world model training
β βββ train_e2e_planner.py
β βββ train_world_model.py
β βββ train_vla_agent.py
β βββ train_vla_3d.py # 3D camera VLA training
β βββ run_vista_single.py # Vista per-scenario inference
β βββ interactive_metadrive.py # 3D GUI for VLA / manual driving
β βββ demo_vista.py
βββ src/
β βββ data/ # nuScenes loader + Lift-Splat-Shoot BEV
β βββ e2e_planner/ # UniAD-style planner (17M params)
β βββ vla_agent/ # DriveVLM-style with Qwen2.5-VL
β βββ world_model/ # BEV VAE + temporal transformer
β βββ visualization/ # BEV / trajectory rendering
β βββ rl_sim/
β β βββ train.py / demo.py # highway-env training
β β βββ metadrive_train.py # MetaDrive 3D PPO training
β β βββ metadrive_demo.py
β βββ vla_sim/
β βββ env.py / env_3d.py # Language-conditioned environments
β βββ model.py / train_3d.py # VLA models (2D BEV + 3D camera)
β βββ train.py # IL + RL training
β βββ demo.py / demo_3d.py # Visualization demos
β βββ video.py # MP4 / GIF generator
βββ assets/ # README images
βββ notebooks/
βββ full_pipeline_demo.ipynb # Interactive walkthrough
| Component | Need |
|---|---|
| Python | β₯ 3.9 |
| PyTorch | β₯ 2.1 with CUDA |
| GPU | 8GB+ for training, 24GB for Vista / VLA with Qwen2.5-VL |
| Disk | ~4GB nuScenes mini, +9.4GB Vista weights, +7GB Qwen2.5-VL |
Tested on RTX 3090 (24GB VRAM).
| Module | Metric | Random | Trained | Speedup |
|---|---|---|---|---|
| E2E Planner | Trajectory L1 (m) | 17.5 | ~3 | 5.8Γ |
| VLA Agent | Trajectory match | random | 23.7m vs 17.5m GT | β |
| World Model VAE | BEV reconstruction corr | 0.0 | 0.925 | β |
| highway-env DQN | Episode reward | 7.7 | 46.8 | 6.1Γ |
| highway-env intersection PPO | Episode reward | 0.0 | 4.0 | β |
| MetaDrive PPO | Episode reward | 1.9 | 135.9 | 71Γ |
| VLA 3D (IL) | Episode reward | ~5 | 66.4 | 13Γ |
| VLA 2D (IL+RL) | Episode reward | 5.8 | 141.0 | 24Γ |
- UniAD β "Planning-oriented Autonomous Driving" (CVPR 2023 Best Paper)
- DriveVLM β "The Convergence of Autonomous Driving and Large VLMs"
- Vista β "A Generalizable Driving World Model" (NeurIPS 2024)
- GenAD β "Generalized Predictive Model for Autonomous Driving" (CVPR 2024)
- GAIA-1 β "A Generative World Model for Autonomous Driving" (Wayve)
- Lift-Splat-Shoot β "Encoding Images from Arbitrary Camera Rigs" (ECCV 2020)
- Qwen2.5-VL β Alibaba's open-source vision-language model
- OpenDriveLab/UniAD
- OpenDriveLab/Vista
- hustvl/VAD
- Farama-Foundation/HighwayEnv
- metadriverse/metadrive
- DLR-RM/stable-baselines3
- Qwen/Qwen2.5-VL
MIT License β for research and educational purposes.











