SmolVLM2 represents a fundamental shift in how we think about video understanding - moving from massive models that require substantial computing resources to efficient models that can run anywhere.
Its goal is simple: make video understanding accessible across all devices and use cases, from phones to servers.
Compared with the previous SmolVLM family, SmolVLM2 2.2B model got better at solving math problems with images, reading text in photos, understanding complex diagrams, and tackling scientific visual questions.
You can find more details about model in model card and HuggingFace blog post
In this tutorial we consider how to convert and optimize SmolVLM2 model for creating multimodal chatbot using Optimum Intel. Additionally, we demonstrate how to apply model optimization techniques like weights compression using NNCF.
The tutorial consists from following steps:
- Install requirements
- Convert and Optimize model
- Run OpenVINO model inference
- Launch Interactive demo
In this demonstration, you'll create interactive chatbot that can answer questions about provided image's or video's content.
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to Installation Guide.