Note: "Lance" here refers to ByteDance Intelligent Creation Lab's unified multimodal model (paper, weights), not Lance/LanceDB (the columnar data format).
MLX port of Lance for Apple Silicon. Lance is a 3B-active / ~12B-total parameter dual-stream Mixture-of-Transformer-Experts model that unifies image and video understanding, generation, and editing in a single framework. This package brings Lance to Apple Silicon via MLX, with weights hosted on the mlx-community HuggingFace organization.
All three repos live in the Lance MLX collection for one-click browsing.
| Repo | Status | Use for |
|---|---|---|
mlx-community/Lance-3B-bf16 |
🟢 Production | t2i, image_edit, x2t_image (full quality, ~15 GB) |
mlx-community/Lance-3B-8bit |
🟢 Production | Same as above, 2.7× faster, 16 GB Mac-friendly (~9 GB) |
mlx-community/Wan2.2-VAE-Lance-bf16 |
🟢 Production | 48-ch Wan2.2 VAE (standalone, shared by image + video pipelines) |
mlx-community/Lance-3B-Video-bf16 |
🟢 Functional | t2v (painterly aesthetic by design), x2t_video, video_edit |
🟡 Image MVP is production; video has a port quality issue under investigation (2026-05-21). All six Lance task families run end-to-end on Apple Silicon. Image (t2i, image_edit, x2t_image) reproduces the bf16 PyTorch reference quality. Video pipelines (t2v, video_edit, x2t_video) produce painterly output where the Phase 0 PyTorch reference produces photorealistic 3D-cinematic — a port-side numerical or routing bug we just identified (issue #2). We're documenting this transparently rather than shipping the wrong framing — earlier model cards described the painterly look as "by design," which the oracle data shows is incorrect.
| Capability | Status |
|---|---|
| Convert HF safetensors → MLX bf16 (both checkpoints + Wan2.2 VAE) | ✅ scripts/02_convert.py, scripts/06_convert_wan_vae.py |
Load Lance_3B + Lance_3B_Video into LanceModel |
✅ 0 missing keys, dummy forward verified |
| x2t_image VQA (image → text answer) | ✅ Production. Content-correct across all 6 oracle cases. |
| KV cache for fast autoregressive decode | ✅ 1.7×–2.8× speedup on long generations |
| t2i (text → image generation) | ✅ Production. Photorealistic, prompt-aligned output. |
| image_edit (instruction-based) | ✅ Production. "Remove hat" preserves identity + style + signature; "Add pearl necklace" leaves rest intact. |
| t2v (text → video) | 🚧 Port quality bug. Runs end-to-end, prompt-aligned content recognizable, but output is painterly where the PyTorch oracle is photorealistic 3D-cinematic. Tracked as issue #2. |
| x2t_video (video VQA) | ✅ Validated against Phase 0 oracle. Cooking video → kitchen+pan+spatula+tomato+meat all content-correct in 17.5 s. (Unaffected by the t2v bug — pure ViT+UND-tower path.) |
| video_edit (instruction-based) | 🚧 Inherits t2v quality issue. End-to-end works ("Change balls to red" recolors); cinematic fidelity blocked on the t2v fix. |
| 8-bit + 4-bit quants + HF community variants | ⏳ Phase 5b |
Try it:
# Install
git clone https://github.com/xocialize/lance-mlx && cd lance-mlx && uv sync
# Download production-ready image MVP (~15 GB):
HF_HUB_DISABLE_XET=1 uv run huggingface-cli download mlx-community/Lance-3B-bf16
# t2i — photorealistic text-to-image:
HF_HUB_DISABLE_XET=1 uv run python scripts/07_t2i_demo.py \
--prompt "A photorealistic tabby cat holding a colorful STOP sign." \
--lance-weights ~/.cache/huggingface/hub/models--mlx-community--Lance-3B-bf16/snapshots/*/ \
--vae-weights ~/.cache/huggingface/hub/models--mlx-community--Lance-3B-bf16/snapshots/*/vae.safetensors
# image_edit — instruction-based editing:
HF_HUB_DISABLE_XET=1 uv run python scripts/13_image_edit_demo.py \
--input-image my_photo.jpg \
--instruction "Remove the hat from the painting." \
--lance-weights .../Lance-3B-bf16 --vae-weights .../vae.safetensors
# x2t_image — image VQA:
HF_HUB_DISABLE_XET=1 uv run python scripts/04_x2t_image_demo.py \
--case 03 \
--lance-weights .../Lance-3B-bf16 \
--vit-weights .../Lance-3B-bf16/vit.safetensorsSee HANDOFF.md for the phased roadmap (start with the ⚠ Verified findings (2026-05-19) section — it supersedes earlier guesses). Phase 0 parity-oracle capture runbook lives at Docs/RUNPOD_PHASE0.md. Per-phase technical notes in notes/.
uv pip install lance-mlx
# Image generation
lance-mlx generate --task t2i --prompt "..." --weights mlx-community/Lance-3B-bf16
# Image editing
lance-mlx generate --task image_edit --image foo.jpg --instruction "..." --weights mlx-community/Lance-3B-bf16
# Image understanding (VQA)
lance-mlx generate --task x2t_image --image foo.png --prompt "What is this?"
# Video generation (alpha)
lance-mlx generate --task t2v --prompt "..." --weights mlx-community/Lance-3B-Video-bf16t2i— text-to-image (768²)t2v— text-to-video (480p, 12 fps, ≤121 frames)image_edit— instruction-based image editingvideo_edit— instruction-based video editingx2t_image— image understanding / VQA / captioningx2t_video— video understanding / VQA / captioning
- Two expert towers (
LLM_UND,LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm - Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens →
LLM_UND(autoregressive next-token); Wan2.2 3D causal VAE latent tokens →LLM_GEN(flow-matching velocity prediction) - MaPE — modality-aware RoPE with per-modality temporal offset
- Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent — Lance bundles its own VAE; do NOT use the public 16-ch
wan2.2_vae.safetensors) - Untied LM head
Blaizzy/mlx-vlmfor the Qwen2.5-VL ViT and autoregressive decode infrastructureBlaizzy/mlx-videofor the Wan2.2 VAE and flow-matching sampler
- Minimum: Apple Silicon Mac with 16 GB unified memory (4-bit quantized image only)
- Recommended: 32 GB+ for bf16 image, 64 GB+ for video
- Reference platform: M5 Max 128 GB (macOS 26.2+ for Neural Accelerator support)
.
├── HANDOFF.md phased port plan (this is the spec)
├── pyproject.toml uv-managed
├── src/lance_mlx/
│ ├── __init__.py
│ ├── __main__.py CLI entry point
│ ├── bench.py Timer + RunRecord + JSONL logging
│ ├── io.py image/video IO + muxing
│ ├── model/
│ │ ├── lance_llm.py dual-expert MoT backbone
│ │ ├── mape.py modality-aware RoPE
│ │ ├── flow_head.py velocity prediction head
│ │ └── routing.py token modality routing
│ ├── pipeline/
│ │ ├── t2i.py text-to-image flow loop
│ │ ├── t2v.py text-to-video flow loop
│ │ ├── image_edit.py
│ │ ├── video_edit.py
│ │ └── understanding.py x2t_image + x2t_video AR decode
│ └── convert.py HF → MLX weight conversion
├── scripts/
│ ├── 00_capture_oracle.py Phase 0 PyTorch reference capture (runs on cloud GPU)
│ ├── 01_inspect_keys.py Phase 1a weight topology audit
│ ├── 02_convert.py Phase 1e weight conversion
│ ├── 03_run_understanding.py Phase 2 x2t pipeline
│ ├── 04_run_t2i.py Phase 3 T2I
│ ├── 05_quantize.py Phase 5a quantization
│ └── 06_publish_hf.py Phase 5c HF upload (dry-run default)
├── prompts/
│ ├── t2i_eval.json
│ ├── t2v_eval.json
│ └── understanding_eval.json
├── tests/
│ ├── fixtures/ Phase 0 PyTorch reference outputs
│ ├── test_routing.py
│ ├── test_mape.py
│ ├── test_vae_roundtrip.py
│ └── test_parity_t2i.py
├── notes/ phase-by-phase educational notes
└── vendor/ read-only reference clones
This MLX port: Apache 2.0.
Lance model weights: Apache 2.0 (ByteDance Intelligent Creation Lab). Wan2.2 VAE: Apache 2.0 (Alibaba). Qwen2.5-VL: Apache 2.0 (Alibaba).
See LICENSE and NOTICE for full attribution.
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}