On-device Multi-modal Retrieval: SigLIP2 in Unity Sentis
This repository provides a lightweight inference implementation optimized for SigLIP2 (specifically siglip2-base-patch16-224) using Unity Sentis. This project enables powerful multi-modal retrieval tasks directly within Unity, allowing for smart asset search and semantic retrieval across text and images without requiring an internet connection.
- ✅ Text-to-Image Search
- ✅ Image-to-Image Search
- ✅ Text-to-Text Search
- ✅ Image-to-Text Search
- Unity:
6000.2.10f1 - Unity Sentis:
2.4.1(com.unity.ai.inference)
The project utilizes the ONNX version of the siglip2-base-patch16-224 model. It requires both the text encoder and the vision encoder to compute embeddings for multi-modal tasks.
Text input processing is handled by the Google SentencePieceTokenizer, implemented using the Microsoft.ML.Tokenizers library. This ensures that text queries are correctly tokenized and encoded to match the SigLIP2 model's expected input format.
The system generates a local database (image_embeddings.bin) by processing images located in the StreamingAssets folder. This allows for real-time similarity search during runtime.
- Download
text_model.onnxandvision_model.onnxfrom onnx-community/siglip2-base-patch16-224-ONNX - Place both files into the
/Assets/SigLIP2directory in your project
- Clone or download this repository
- Unzip the provided StreamingAssets.zip file
- Note: The demo images included in this zip are sourced from the Fashion Product Images (Small) dataset.
- Place the unzipped contents into the
/Assets/StreamingAssetsdirectory - Ensure that your .jpg or .png files are located inside
/Assets/StreamingAssets/Images
- Open the
/Assets/Scenes/SigLip2Scene.unityscene in the Unity Editor - Select the
ImageSearchManagerobject in the hierarchy - Click the "Generate And Save Embeddings (Create Index)" button in the Inspector
- This process will read images from StreamingAssets and generate the
image_embeddings.binfile
- Play the scene to see the retrieval in action
- Input keywords to perform image retrieval, or explore other tasks like Image-to-Image or Text-to-Text search
Experience SigLIP2 in Unity in action! Check out our demo showcasing the retrieval capabilities:
- Google SigLIP2
- Onnx Community: SigLIP2-Base-Patch16-224
- Dataset: Fashion Product Images (Small)
- Unity Sentis Documentation
This project uses the SigLIP2 model which is licensed under the Apache 2.0 License.
