# Prismer: A Vision-Language Model with Multi-Task Experts

*Vishal*

---

## 🧠 Motivation

As deep learning systems scale, so does the need for efficient and effective multimodal reasoning. The Prismer model offers a promising direction: instead of training huge monolithic architectures on massive datasets, it integrates pre-trained task-specific vision experts into a unified vision-language model, offering both performance and efficiency. I found this paradigm shift interesting and relevant to current trends in efficient deep learning and compositionality.

---

## 🔍 Connection with Past & Current Work in Multimodal Learning

Historically, vision-language models (VLMs) like UNITER, ViLT, and BLIP rely on large-scale image-text pairs and massive compute. Prismer challenges this by leveraging frozen, pre-trained experts across multiple vision tasks (e.g., depth estimation, OCR, segmentation) and combines them using a lightweight encoder-decoder transformer. It draws inspiration from works like Perceiver, Flamingo, and the concept of Socratic models, while distinguishing itself with strong fine-tuned performance and low compute requirements.

---

## 📚 Key Learnings

- Multi-task vision signals improve generalization without increasing trainable parameters significantly.
- Robustness to noisy experts implies resiliency and flexible deployment.
- Fine-tuning only ~20% of the model achieves performance comparable to full fine-tuning.
- Leveraging diverse visual experts enriches semantic understanding in downstream tasks like VQA and captioning.

---

## 🧪 Code / Experiments

[Click here](https://github.com/vishalsai0234/prismer-blog-explained)

---



## 💭 Reflections

What surprised me?

- Prismer with frozen experts outperforms many fully trainable VLMs with 10–100× more data.

- The model’s resilience to poor-quality (even noisy) experts is counterintuitive and inspiring.

Scope for improvement:

- Extend Prismer to work with dynamic or missing expert inputs.

- Explore sequential reasoning with text-based expert representations (e.g., using Pix2Seq).

- Improve adaptability to new experts without full retraining.

---

## 🔗 References
[Prismer GitHub Repo](https://github.com/NVlabs/prismer)

[Prismer Paper (2024)](https://arxiv.org/abs/2303.02506)

[Visual Haystacks Blog Example](https://bair.berkeley.edu/blog/2024/07/20/visual-haystacks/)

---