🌔 moondream

a tiny vision language model that kicks ass and runs anywhere

Examples

Image

Example

What is the girl doing?
The girl is sitting at a table and eating a large hamburger.

What color is the girl's hair?
The girl's hair is white.

What is this?
This is a computer server rack, which is a device used to store and manage multiple computer servers. The rack is filled with various computer servers, each with their own dedicated space and power supply. The servers are connected to the rack via multiple cables, indicating that they are part of a larger system. The rack is placed on a carpeted floor, and there is a couch nearby, suggesting that the setup is in a living or entertainment area.

What is behind the stand?
Behind the stand, there is a brick wall.

About

Moondream is a highly efficient open-source vision language model that combines powerful image understanding capabilities with a remarkably small footprint. It's designed to be versatile and accessible, capable of running on a wide range of devices and platforms.

The project offers two model variants:

Moondream 2B: The primary model with 2 billion parameters, offering robust performance for general-purpose image understanding tasks including captioning, visual question answering, and object detection.
Moondream 0.5B: A compact 500 million parameter model specifically optimized as a distillation target for edge devices, enabling efficient deployment on resource-constrained hardware while maintaining impressive capabilities.

Getting Started

Latest Model Checkpoints

These are the latest bleeding-edge versions of both models, with all new features and improvements:

Model	Precision	Download Size	Memory Usage	Best For	Download Link
Moondream 2B	int8	1,733 MiB	2,624 MiB	General use, best quality	Download
Moondream 0.5B	int8	593 MiB	996 MiB	Edge devices, faster speed	Download

Python Client Library

First, install the client library:

pip install moondream==0.0.5

The recommended way to use the latest version of Moondream is through our Python client library:

import moondream as md
from PIL import Image

# Initialize with local model path. Can also read .mf.gz files, but we recommend decompressing
# up-front to avoid decompression overhead every time the model is initialized.
model = md.vl(model="path/to/moondream-2b-int8.mf")

# Load and process image
image = Image.open("path/to/image.jpg")
encoded_image = model.encode_image(image)

# Generate caption
caption = model.caption(encoded_image)["caption"]
print("Caption:", caption)

# Ask questions
answer = model.query(encoded_image, "What's in this image?")["answer"]
print("Answer:", answer)

⚠️ Note: The Python client currently only supports CPU inference. CUDA (GPU) and MPS (Apple Silicon) optimization is coming soon. For GPU support, use the Hugging Face transformers implementation below.

For complete documentation of the Python client, including cloud API usage and additional features, see the Python Client README.

Node.js Client Library

For JavaScript/TypeScript developers, we offer a full-featured Node.js client library. See the Node.js Client README for installation and usage instructions.

Hugging Face Transformers Integration

The Hugging Face hub version tracks the last official release of the 2B model. While more stable, it doesn't include the latest features or support for the 0.5B model. Use this if you need GPU acceleration or prefer the transformers ecosystem:

First, install the required packages:

pip install transformers torch einops

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-08-26"  # Pin to specific version
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

For GPU acceleration, you can add:

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")

Name		Name	Last commit message	Last commit date
Latest commit History 232 Commits
.github/workflows		.github/workflows
assets		assets
clients		clients
moondream		moondream
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch_generate_example.py		batch_generate_example.py
gradio_demo.py		gradio_demo.py
hf_release.py		hf_release.py
requirements.txt		requirements.txt
sample.py		sample.py
webcam_gradio_demo.py		webcam_gradio_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌔 moondream

Examples

About

Getting Started

Latest Model Checkpoints

Python Client Library

Node.js Client Library

Hugging Face Transformers Integration

About

Contributors 19

Languages

License

vikhyat/moondream

Folders and files

Latest commit

History

Repository files navigation

🌔 moondream

Examples

About

Getting Started

Latest Model Checkpoints

Python Client Library

Node.js Client Library

Hugging Face Transformers Integration

About

Resources

License

Stars

Watchers

Forks

Contributors 19

Languages