Skip to content

Latest commit

 

History

History

gemma3

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Visual-language assistant with Gemma3 and OpenVINO

Gemma 3 is Google's new iteration of open weight LLMs. It comes in four sizes, 1 billion, 4 billion, 12 billion, and 27 billion parameters with base (pre-trained) and instruction-tuned versions. The 4, 12, and 27 billion parameter models can process both images and text, while the 1B variant is text only.

The input context window length has been increased from Gemma 2’s 8k to 32k for the 1B variants, and 128k for all others. As is the case with other VLMs (vision-language models), Gemma 3 generates text in response to the user inputs, which may consist of text and, optionally, images. Example uses include question answering, analyzing image content, summarizing documents, etc.

The three core enhancements in Gemma 3 over Gemma 2 are:

  • Longer context length
  • Multimodality
  • Multilinguality

You can find more details about model in the blog post.

In this tutorial we consider how to convert and optimize Gemma3 model for creating multimodal chatbot using Optimum Intel. Additionally, we demonstrate how to apply model optimization techniques like weights compression using NNCF.

Notebook contents

The tutorial consists from following steps:

  • Install requirements
  • Convert and Optimize model
  • Run OpenVINO model inference
  • Launch Interactive demo

In this demonstration, you'll create interactive chatbot that can answer questions about provided image's content. The image bellow illustrates example of input prompt and model answer. example.png

Installation instructions

This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.