Cognitive AI for the Future: Multimodal Models and RAG in Vision Language Applications, from Training to Deployment.
Zhuo Wu, Tiep Le, Gustavo A. Lujan, Adrian Boguszewski, Raymond Lo, Vasudev Lal & Yury Gorbachev
This tutorial demonstrates how to build cognitive AI systems using multimodal models[1], Retrieval-Augmented Generation (RAG)[2], and agentic workflows[3]. Attendees will learn how to prepare an Instruction dataset [4] for model fine-tuning, with LLMs deployed and accelerated with OpenVINO on your local machines, to protect your data privacy.You will also learn to fine-tune embedding models like BridgeTower[5] and vision language models (VLMs) like Llama-3.2-11B-Vision-Instruct [6], store embeddings in vector databases, and integrate VLM for video-based chat. Using OpenVINO[7], models are optimized for low-latency inference on AI PCs. Tools like LlamaIndex and LangChain enable efficient retrieval, while agentic workflows add reasoning and dynamic tool use. Through hands-on examples, participants will see how to apply this pipeline to real-world scenarios—like video Q&A, personalized shopping, and education—showcasing the potential of scalable, context-aware cognitive AI.
- Fundamentals: Cognitive AI: Multimodal RAG in Vision Language Applications.
- Module 1: Accelerate your dataset preparation locally: OpenVINO Fundamentals.
- Module 2: Finetuning Embedding models and LVLM.
- Module 3: Optimize and Deploy the Multimodal RAG Pipeline.
- Module 4: Build Your Own AI assistant with Agentic Multimodal RAG.
All Modules will be evaluate and deployed in an AI PC or edge system. Multiple Computer Vision tasks and Gen AI pipelines on a wide range of HW.
[1] H. Liu, et al., " Improved Baselines with Visual Instruction Tuning", arXiv:2310.03744 Available online: arXiv [Accessed 11 December 2024].
[2] Y. Gao, et al., " Retrieval-Augmented Generation for Large Language Models: A Survey", arXiv:2312.10997 Available online: arXiv [Accessed 11 December 2024].
[3] Z. Durante, et al., Agent AI: Surveying the Horizons of Multimodal Interaction", arXiv:2401.03568 Available online: arXiv [Accessed 10 December 2024].
[4] C. Li, et al., “LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day”, arXiv: 2306.00890 Available online: arXiv [Accessed 12 December 2024].
[5] X. Xu, et al., “BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning ”, arXiv:2206.08657 Available online: arXiv [Accessed 9 December 2024].
[6] Llama-3.2-11B-Vision-Instruct: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct.
[7] openvino.ai

