A vision encoder distilled from DINOv3 and SigLIP2 teachers, supporting multi-resolution image understanding with Mixture-of-Experts (MoE) architecture.
wget https://github.com/tiiuae/amoe/releases/download/AMoE-checkpoint/amoe_ckpt.pt
pip install -r requirements.txt
pip install -e .from amoe import load_amoe_model
from PIL import Image
import torch
# Load model
model, processor = load_amoe_model(
checkpoint_path="path/to/checkpoint.pt",
device="cuda",
)
model = model.to(torch.bfloat16)
# Load and preprocess image
image = Image.open("image.jpg")
inputs = processor(image, return_tensors="pt")
# Get features
with torch.no_grad():
pixel_values = inputs["pixel_values"].to("cuda", dtype=torch.bfloat16)
spatial_shapes = inputs["spatial_shape"].to("cuda")
padding_mask = inputs["padding_mask"].to("cuda")
outputs = model(
pixel_values=pixel_values,
spatial_shapes=spatial_shapes,
padding_mask=padding_mask
)
# DINOv3-style patch features
patch_features = outputs["patch_features"]["dinov3"] # (N, L, 1024)
# SigLIP2-style pooled features
pooled_features = outputs["summary_features"]["siglip2"] # (N, 1152)
# Native model features
amoe_features = outputs["patch_features"]["amoe"] # (N, L, 768)To visualize the principal components of the features:
python pca_maps.py \
--ckpt_path path/to/checkpoint.pt \
--input_dir path/to/images/ \
--output_path ./output_viz/ \
--num_samples 10Sample output:
import torch
from PIL import Image
from transformers import AutoModel, AutoImageProcessor
# Load model and processor
model_id = "tiiuae/amoe"
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to("cuda", dtype=torch.bfloat16)
processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=True)
# Preprocess image
image = Image.open("image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt").to("cuda")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
# Inference
with torch.no_grad():
outputs = model(**inputs)
# Access specialized features
# Options: 'amoe' (768d), 'siglip2' (1152d), 'dinov3' (1024d)
patch_features = outputs["patch_features"]["amoe"] # (Batch, Tokens, 768)
summary_features = outputs["summary_features"]["siglip2"] # (Batch, 1152)If you use AMoE in your research, please cite:
@article{chaybouti2025amoe,
title={AMOE: Agglomerative Mixture-of-Experts Vision Foundation Models},
author={Chaybouti, Sofian and Narayan, Sanath and Dahou, Yasser and Le Khac, Phuc H. and Singh, Ankit and Huynh, Ngoc Dung and Para, Wamiq Reyaz and Kuehne, Hilde and Hacid, Hakim},
journal={arXiv preprint arXiv:2512.20157},
year={2025}
}
