Skip to content

tiiuae/amoe

Repository files navigation

AMOE: Agglomerative Mixture-of-Experts Vision Foundation Models

Project Website arXiv Hugging Face Models Hugging Face Datasets

A vision encoder distilled from DINOv3 and SigLIP2 teachers, supporting multi-resolution image understanding with Mixture-of-Experts (MoE) architecture.

Main fig

Installation

wget https://github.com/tiiuae/amoe/releases/download/AMoE-checkpoint/amoe_ckpt.pt  
pip install -r requirements.txt
pip install -e .

Quick Start

from amoe import load_amoe_model
from PIL import Image
import torch

# Load model
model, processor = load_amoe_model(
    checkpoint_path="path/to/checkpoint.pt",
    device="cuda",
)
model = model.to(torch.bfloat16)

# Load and preprocess image
image = Image.open("image.jpg")
inputs = processor(image, return_tensors="pt")

# Get features
with torch.no_grad():
    pixel_values = inputs["pixel_values"].to("cuda", dtype=torch.bfloat16)
    spatial_shapes = inputs["spatial_shape"].to("cuda")
    padding_mask = inputs["padding_mask"].to("cuda")
    
    outputs = model(
        pixel_values=pixel_values,
        spatial_shapes=spatial_shapes,
        padding_mask=padding_mask
    )
    
    # DINOv3-style patch features
    patch_features = outputs["patch_features"]["dinov3"]  # (N, L, 1024)
    
    # SigLIP2-style pooled features
    pooled_features = outputs["summary_features"]["siglip2"]  # (N, 1152)
    
    # Native model features
    amoe_features = outputs["patch_features"]["amoe"]  # (N, L, 768)

PCA Visualization

To visualize the principal components of the features:

python pca_maps.py \
    --ckpt_path path/to/checkpoint.pt \
    --input_dir path/to/images/ \
    --output_path ./output_viz/ \
    --num_samples 10

Sample output:

PCA visualization sample 1

HF usage

import torch
from PIL import Image
from transformers import AutoModel, AutoImageProcessor

# Load model and processor
model_id = "tiiuae/amoe" 
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to("cuda", dtype=torch.bfloat16)
processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=True)

# Preprocess image
image = Image.open("image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt").to("cuda")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

# Inference
with torch.no_grad():
    outputs = model(**inputs)

# Access specialized features
# Options: 'amoe' (768d), 'siglip2' (1152d), 'dinov3' (1024d)
patch_features = outputs["patch_features"]["amoe"]    # (Batch, Tokens, 768)
summary_features = outputs["summary_features"]["siglip2"] # (Batch, 1152)

Citation

If you use AMoE in your research, please cite:

@article{chaybouti2025amoe,
  title={AMOE: Agglomerative Mixture-of-Experts Vision Foundation Models},
  author={Chaybouti, Sofian and Narayan, Sanath and Dahou, Yasser and Le Khac, Phuc H. and Singh, Ankit and Huynh, Ngoc Dung and Para, Wamiq Reyaz and Kuehne, Hilde and Hacid, Hakim},
  journal={arXiv preprint arXiv:2512.20157},
  year={2025}
}

About

AMoE: Agglomerative Mixture-of-Experts Vision Foundation Models

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages