Skip to content

vibekernels/slam-hands

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

slam-hands

Experimental annotation pipeline for egocentric robot learning videos. Produces LeRobot v3.0 datasets following the EgoVerse format with:

  • SLAM camera poses (7-DoF: translation + quaternion) via CUDA DROID-SLAM
  • 3D hand poses (21 MANO keypoints per hand) via WiLoR
  • Audio transcription (word-level timestamps) via NVIDIA Parakeet
  • Video conversion (iPhone HEVC HDR to SDR MP4)

Quick start

# Recommended: uses optimized settings (~20s for 62s of 1080p30 on RTX 5090)
./annotate.sh /path/to/video.mov

# Or with explicit output directory:
./annotate.sh /path/to/video.mov /path/to/output

# Visualize results
python3 visualizer.py /path/to/output --port 8888
# Open http://localhost:8888 in a browser

# Or upload a video through the browser:
python3 visualizer.py --port 8888
# Open http://localhost:8888 and drag-and-drop a video

# Service mode: keep models warm for fast repeated processing
python3 visualizer.py --service --port 8888
# First video: ~23s, subsequent: ~20s (vs ~39s cold start each time)

# Or use the service programmatically:
python3 pipeline_service.py --listen
# Send JSON jobs on stdin: {"video_path": "/path/to/video.mov", "output_dir": "/path/to/output"}

Pipeline options

positional arguments:
  input                    Input video file (e.g., IMG_1443.MOV)

options:
  -o, --output-dir         Output dataset directory (default: <input>_lerobot/)
  --skip-video-convert     Skip video format conversion
  --skip-slam              Skip SLAM camera pose estimation
  --skip-hands             Skip hand pose estimation
  --skip-audio             Skip audio transcription
  --asr-model MODEL        NeMo ASR model name (default: parakeet-tdt_ctc-110m)
  --hand-stride N          Process every Nth frame for hands (default: 2)
  --hand-det-conf F        YOLO hand detection confidence threshold (default: 0.3)
  --fast-traj              Use fewer SLAM backend steps for faster trajectory
  --slam-backend-steps N N Backend optimization steps (default: 3 5)
  --device DEVICE          Torch device (default: cuda)

Output format

output_dir/
  meta/
    info.json                                    # LeRobot v3.0 metadata
    audio.json                                   # Full transcript + word/segment timestamps
    tasks.parquet
    episodes/chunk-000/file-000.parquet
  data/
    chunk-000/file-000.parquet                   # Per-frame annotations
  videos/
    observation.video/chunk-000/file-000.mp4     # Converted video

The parquet contains per-frame columns:

Column Shape Description
observation.slam.pose 7 Camera pose [tx, ty, tz, qx, qy, qz, qw]
observation.hand.{left,right}.keypoints_2d 42 21 joints x 2 (pixel coords)
observation.hand.{left,right}.keypoints_3d 63 21 joints x 3 (camera frame, meters)
observation.hand.{left,right}.detected 1 Detection flag (0 or 1)
observation.audio.transcript 1 Active transcript segment (string)

Installation

System requirements

  • Linux (tested on Ubuntu 22.04+)
  • NVIDIA GPU with 11+ GB VRAM (sm_80+: Ampere, Ada Lovelace, Blackwell)
  • CUDA 12.1+ with cuDNN, cuBLAS, cuSOLVER
  • Python 3.10+
  • FFmpeg 6.0+ with ffprobe (built with --enable-cuvid for NVDEC hardware decode)

System packages

sudo apt-get update
sudo apt-get install -y \
  build-essential cmake nasm git \
  ffmpeg \
  libavformat-dev libavcodec-dev libswscale-dev libavutil-dev \
  libsuitesparse-dev

Python dependencies

pip install --upgrade pip

# PyTorch (match your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# or for CUDA 12.8:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Pipeline dependencies
pip install pyarrow numpy opencv-python scipy tqdm pybind11

CUDA DROID-SLAM

The SLAM module is a standalone CUDA binary with no PyTorch dependency:

cd robot-video/cuda_slam

# Export weights from a DROID-SLAM checkpoint (one-time)
python3 export_weights.py --checkpoint /path/to/droid.pth --out data

# Build the binary (requires CUDA 12.1+, cuDNN, cuBLAS, cuSOLVER, FFmpeg dev headers)
make

The DROID-SLAM checkpoint (droid.pth, ~16 MB) can be obtained from the original repo. Weights are exported to cuda_slam/data/weights/ as raw float32 tensors.

The binary supports NVDEC hardware video decode (--video <path> --resize <h> <w>) which decodes and resizes frames entirely on the GPU, or stdin mode (--stdin <h> <w>) for piped input.

CUDA Hand Pose

The hand pose module is a standalone CUDA binary (YOLO detection + WiLoR ViT + MANO) with no PyTorch dependency at inference:

cd robot-video/cuda_hand

# Export weights from pretrained models (one-time, requires PyTorch)
python3 export_weights.py

# Build the binary (requires CUDA 12.1+, cuDNN, cuBLAS, FFmpeg dev headers)
make

The export script requires the original model weights:

# YOLO hand detector + WiLoR checkpoint
mkdir -p pretrained_models
wget https://huggingface.co/spaces/rolpotamias/WiLoR/resolve/main/pretrained_models/detector.pt -P pretrained_models/
wget https://huggingface.co/spaces/rolpotamias/WiLoR/resolve/main/pretrained_models/wilor_final.ckpt -P pretrained_models/

# MANO hand model (requires registration at https://mano.is.tue.mpg.de)
mkdir -p mano_data
cp /path/to/MANO_RIGHT.pkl mano_data/

Weights are exported to cuda_hand/data/weights/ as raw binary tensors. After export, PyTorch is only needed for SLAM weight export — the hand pipeline runs entirely in CUDA.

Combined Pipeline (optional, faster)

A single CUDA binary that runs both SLAM and hand processing with shared NVDEC decode, eliminating GPU context switching and process overhead:

cd robot-video/cuda_pipeline
make

Requires the same weights as the individual SLAM and hand binaries above. The combined pipeline is automatically detected and used by pipeline_service.py when present. Falls back to separate binaries if not built.

NeMo (Parakeet ASR)

pip install "nemo_toolkit[asr]"

The Parakeet model (parakeet-tdt_ctc-110m) is downloaded automatically on first use (~110M parameters, ~440 MB). No audio stream in the video? Transcription is skipped automatically.

Native video decoder (optional, faster)

Builds a GIL-free C++ video decoder for concurrent frame extraction:

cd robot-video/native_decode
pip install -e .

Requires the FFmpeg dev headers installed above. Falls back to OpenCV if not available.

Verify installation

./cuda_slam/cuda_droid --help 2>&1 | head -1 && echo "CUDA DROID-SLAM OK"
./cuda_hand/cuda_hand --help 2>&1 | head -1 && echo "CUDA Hand Pose OK"
./cuda_pipeline/cuda_pipeline --help 2>&1 | head -1 && echo "Combined Pipeline OK"
python3 -c "import nemo.collections.asr; print('NeMo ASR OK')"
python3 -c "import native_decode; print('native_decode OK')"

Visualizer

Browser-based tool for inspecting output datasets. Shows the video with 2D hand keypoint overlay and a Three.js 3D scene with camera trajectory and hand skeletons.

# View an existing dataset
python3 visualizer.py /path/to/output_dataset --port 8888

# Upload mode: drag-and-drop a video in the browser to process + visualize
python3 visualizer.py --port 8888

Controls:

  • Space: play/pause
  • Arrow keys: step 1 frame
  • Shift + arrows: step 10 frames
  • Mouse drag on 3D panel: orbit camera
  • Scroll on 3D panel: zoom
  • Checkboxes: toggle left/right hands, bones, 3D hands, camera follow

Performance

Benchmarked on 1829 frames (61s) of 1920x1080 30fps iPhone video, RTX 5090:

Phase Time Notes
SLAM + hand pose (combined) ~17.3s Single CUDA binary, single-pass decode, interleaved SLAM + hands
   YOLO hand detection ~1.6s YOLOv8m FP16, 8-frame batches
   WiLoR hand pose ~7.0s ViT-H FP16 + MANO + RefineNet
SLAM backend optimization ~0.5s GPU Schur complement + distance-based edge selection
Audio transcription ~8s Parakeet on CPU, fully overlapped
Video conversion (h264_nvenc) ~3s NVENC runs concurrently (dedicated HW encoder)
Dataset assembly + Python overhead ~2.5s Parquet + info.json + result parsing
Total ~20s Service mode (warm), combined binary, all outputs included

Key optimizations:

  1. Combined CUDA binary (cuda_pipeline) — SLAM and hand processing run in a single process with shared NVDEC video decode. Single-pass architecture: each frame is decoded once, downscaled for SLAM and converted at full resolution for hand processing in the same iteration. Eliminates redundant video decode and GPU context switching. In service mode, SLAM state resets between videos without reloading weights (~2s savings per video).
  2. CUDA DROID-SLAM — Pure CUDA reimplementation of DROID-SLAM (cuDNN convolutions, cuBLAS correlation, GPU bundle adjustment). No PyTorch dependency at inference. Matches PyTorch DROID-SLAM accuracy on TUM RGB-D benchmarks (ATE RMSE 0.021m on fr1_desk, 0.077m on fr1_room). FP16 tensor core encoders (fnet/cnet) and FP16 correlation via cublasGemmEx/cublasGemmBatchedEx with FP32 accumulation. GPU Schur complement assembly (EEt/Ev kernels with CSR index structures) keeps BA at 0.5ms/iteration — only the small Cholesky solve runs on CPU. Backend uses distance-based co-visibility edges for loop closure, matching the original implementation. NVDEC hardware video decode with GPU resize eliminates CPU decode entirely. A sliding frontend window (default 15 keyframes) prevents quadratic BA cost scaling on long videos.
  3. CUDA hand pose — Pure CUDA reimplementation of the full hand pose pipeline: YOLOv8m detection, WiLoR ViT-H, MANO skinning, and RefineNet. No PyTorch dependency at inference. YOLO runs in FP16 with tensor cores and 8-frame batching. WiLoR ViT uses FP16 GEMMs with FP32 accumulation via cublasGemmEx. GPU-side ViT output decode (cuBLAS strided batched GEMMs) eliminates 103MB D2H transfer per batch. All GPU buffers pre-allocated at startup; MANO and RefineNet weights cached on host.
  4. NVDEC hardware video decode — Single NVDEC decode shared between SLAM and hand processing. Decoded frames arrive directly in GPU memory (P010/NV12 format), converted to BGR by custom CUDA kernels at both SLAM resolution and full resolution in the same iteration. Eliminates CPU decode and host-to-device transfer entirely.
  5. Concurrent NVENC video conversion — NVENC uses a dedicated hardware encoder separate from CUDA cores, so it runs concurrently with SLAM and hand detection. Started at the beginning of the pipeline rather than waiting for inference to complete.
  6. Service modepipeline_service.py keeps models loaded across videos via a persistent cuda_pipeline --listen worker. One-time ~9s startup (WiLoR weight loading), then each video reuses warm GPU buffers. SLAM state resets in-place without reallocating memory or reloading weights. Use --service flag with the visualizer or --listen for batch processing.

Pipeline timing (service mode, RTX 5090)

  Model loading:    10.9s  (one-time)
  Run 1 (cold):     23.1s
  Run 2 (warm):     20.1s
  Run 3 (warm):     20.9s

SLAM accuracy (TUM RGB-D benchmark)

CUDA reimplementation matches PyTorch DROID-SLAM:

Sequence CUDA PyTorch Paper
fr1_desk 0.021m 0.021m 0.018m
fr1_room 0.077m 0.072m 0.047m

ATE RMSE after Sim(3) alignment. Both implementations use the same weights (droid.pth), motion threshold (2.5), and backend parameters (distance-based co-visibility edges, radius 2, nms 3, threshold 22.0).

Bundle adjustment performance

GPU Schur complement assembly vs previous CPU implementation (1812 BA iterations on fr1_desk):

Phase GPU CPU (previous) Speedup
Jacobian 0.10 ms/iter 0.10 ms/iter
Schur complement 0.28 ms/iter 36.9 ms/iter 132x
Cholesky solve 0.01 ms/iter 0.01 ms/iter
Back-substitution 0.09 ms/iter 3.7 ms/iter 42x
Total BA 0.9s 74s 81x

Video conversion (standalone)

The video converter can also be used independently:

# Fastest (requires NVIDIA GPU with NVDEC + NVENC + scale_cuda):
python3 convert_video.py /path/to/iphone_video.mov --fast

# Default (auto-selects: h264_nvenc if available, else libx264 ultrafast):
python3 convert_video.py /path/to/iphone_video.mov

Output: <input>_lerobot.mp4 in the same directory, or specify -o /path/to/output.mp4.

Video conversion options

-o, --output     Output file path (default: <input>_lerobot.mp4)
--fast           Fastest: NVDEC->scale_cuda->h264_nvenc, all on GPU (10x speedup)
--no-gpu         Disable NVDEC hardware-accelerated decoding
--quality N      CRF value (default: 30, lower = better quality)
--gop N          GOP size (default: 2)
--preset N       libsvtav1 speed preset (default: 12, fastest)
--nvenc          Use NVENC AV1 hardware encoder with zscale tonemapping
--gpu-pipeline   NVDEC->CUDA tonemap->NVENC (requires gpu_convert binary)
--bitrate N      Target bitrate in kbps for NVENC modes (auto-estimated if not set)

LeRobotDataset v3.0 video compatibility

The following are required for LeRobotDataset v3.0 compatibility (lerobot >= 0.4.0):

Requirement Value Why
Container MP4 Required by lerobot's video loading
Pixel format yuv420p Required (yuv444p auto-downgraded)
GOP size 2 Every other frame is a keyframe, for fast random frame access during training
Audio None Not needed for robot datasets

The following are configurable -- lerobot supports multiple codecs and quality settings:

Parameter lerobot default Valid options
Codec libsvtav1 (AV1) h264, hevc, libsvtav1, or hardware variants (h264_nvenc, hevc_nvenc, h264_videotoolbox, etc.)
Quality (CRF) 30 Any CRF value
movflags +faststart Recommended for streaming

Output must be decodable by PyAV and/or torchcodec (LeRobotDataset's video backends).

iPhone-specific handling

iPhone videos are typically HEVC Main 10 with HDR (BT.2020 primaries, HLG transfer, Dolby Vision profile 8). Conversion behavior depends on the codec path:

  • NVENC (default on GPU systems): Skips tonemapping — HLG's backwards-compatible design means the 10-bit→8-bit truncation looks natural on SDR displays. Converts to yuv420p + h264_nvenc.
  • libx264 fallback (no NVENC): Full HDR→SDR tonemapping via zscale (BT.2020/HLG → BT.709), 10-bit→8-bit, color space conversion.
  • --nvenc flag: NVENC AV1 with explicit zscale tonemapping.
  • --fast flag: Full GPU pipeline (NVDEC → scale_cuda → h264_nvenc), no tonemapping.

Non-HDR inputs skip tonemapping in all paths and just do pixel format conversion.

Video conversion performance

Benchmarked on a 60-second 1080p/30fps iPhone 16 Pro video (HEVC 10-bit HDR, 51 MB input) with a Ryzen 9 9950X (32 threads) and RTX 5090:

Version Wall time Output size Speedup
Naive (CPU decode, two-pass tonemap) 20.5s 19.3 MB 1.0x
+ NVDEC hardware decode 16.8s 19.3 MB 1.2x
+ Parallel filter threads 12.3s 19.2 MB 1.7x
+ Single-pass zscale 9.3s 19.2 MB 2.2x
+ SVT-AV1 AVX-512 8.5s 19.2 MB 2.4x
+ NVENC VBR (--nvenc) 4.5s 19.3 MB 4.6x
+ Full GPU pipeline (gpu_convert --tonemap) 2.5s 19.3 MB 8.2x
+ H.264 NVENC, no tonemap (--fast) 2.0s 18.8 MB 10.2x

See OPTIMIZATION.md for detailed optimization notes.

Optional: AVX-512 SVT-AV1

On CPUs with AVX-512 support (Zen 4/5, Ice Lake+), run sudo bash setup.sh to rebuild SVT-AV1 with AVX-512 enabled for ~10% faster encoding. Requires cmake, nasm, and build-essential.

About

3x realtime DROID-SLAM + WiLoR hand pose egocentric video annotation on 1x5090

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors