Gemma 3 is Google's new iteration of open weight LLMs. It comes in four sizes, 1 billion, 4 billion, 12 billion, and 27 billion parameters with base (pre-trained) and instruction-tuned versions. The 4, 12, and 27 billion parameter models can process both images and text, while the 1B variant is text only.
The input context window length has been increased from Gemma 2’s 8k to 32k for the 1B variants, and 128k for all others. As is the case with other VLMs (vision-language models), Gemma 3 generates text in response to the user inputs, which may consist of text and, optionally, images. Example uses include question answering, analyzing image content, summarizing documents, etc.
The three core enhancements in Gemma 3 over Gemma 2 are:
- Longer context length
- Multimodality
- Multilinguality
You can find more details about model in the blog post.
In this tutorial we consider how to convert and optimize Gemma3 model for creating multimodal chatbot using Optimum Intel. Additionally, we demonstrate how to apply model optimization techniques like weights compression using NNCF.
The tutorial consists from following steps:
- Install requirements
- Convert and Optimize model
- Run OpenVINO model inference
- Launch Interactive demo
In this demonstration, you'll create interactive chatbot that can answer questions about provided image's content.
The image bellow illustrates example of input prompt and model answer.
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to Installation Guide.