a tiny vision language model that kicks ass and runs anywhere
Moondream is a highly efficient open-source vision language model that combines powerful image understanding capabilities with a remarkably small footprint. It's designed to be versatile and accessible, capable of running on a wide range of devices and platforms.
The project offers two model variants:
- Moondream 2B: The primary model with 2 billion parameters, offering robust performance for general-purpose image understanding tasks including captioning, visual question answering, and object detection.
- Moondream 0.5B: A compact 500 million parameter model specifically optimized as a distillation target for edge devices, enabling efficient deployment on resource-constrained hardware while maintaining impressive capabilities.
These are the latest bleeding-edge versions of both models, with all new features and improvements:
Model | Precision | Download Size | Memory Usage | Best For | Download Link |
---|---|---|---|---|---|
Moondream 2B | int8 | 1,733 MiB | 2,624 MiB | General use, best quality | Download |
Moondream 0.5B | int8 | 593 MiB | 996 MiB | Edge devices, faster speed | Download |
First, install the client library:
pip install moondream==0.0.5
The recommended way to use the latest version of Moondream is through our Python client library:
import moondream as md
from PIL import Image
# Initialize with local model path. Can also read .mf.gz files, but we recommend decompressing
# up-front to avoid decompression overhead every time the model is initialized.
model = md.vl(model="path/to/moondream-2b-int8.mf")
# Load and process image
image = Image.open("path/to/image.jpg")
encoded_image = model.encode_image(image)
# Generate caption
caption = model.caption(encoded_image)["caption"]
print("Caption:", caption)
# Ask questions
answer = model.query(encoded_image, "What's in this image?")["answer"]
print("Answer:", answer)
For complete documentation of the Python client, including cloud API usage and additional features, see the Python Client README.
For JavaScript/TypeScript developers, we offer a full-featured Node.js client library. See the Node.js Client README for installation and usage instructions.
The Hugging Face hub version tracks the last official release of the 2B model. While more stable, it doesn't include the latest features or support for the 0.5B model. Use this if you need GPU acceleration or prefer the transformers ecosystem:
First, install the required packages:
pip install transformers torch einops
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
model_id = "vikhyatk/moondream2"
revision = "2024-08-26" # Pin to specific version
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))
For GPU acceleration, you can add:
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, revision=revision,
torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")