# Lenna: Language Enhanced Reasoning Detection Assistant
**Author: Ujwal Kirsan**

## Motivation
The motivation behind this project is to explore and enhance multimodal reasoning using vision-language models. With the rise of multimodal large language models (MLLMs), there is a growing demand for systems that can perform complex reasoning over images beyond simple captioning or classification. The Lenna framework attempts to tackle this by integrating a reasoning-based detection system that understands implicit language cues.

## Connection with Past and Current Multimodal Learning Work
Multimodal learning has evolved from early image captioning systems to models capable of tasks like VQA (Visual Question Answering), REC (Referring Expression Comprehension), and more. Recent advances include DetGPT, BLIP-2, and MiniGPT-v2, which incorporate transformers for joint vision-language understanding. Lenna builds upon these by introducing a `<DET>` token that signals detection intent, improving implicit reasoning and localization through enhanced token embeddings and efficient training via LoRA.

## Learning from the Work
From implementing Lenna and analyzing its architecture, I learned:
- How a simple token injection (like `<DET>`) can guide multimodal models in specialized tasks.
- The value of using aligned embeddings for semantic and positional features.
- The significance of training data design (REC, VQA, ReasonDet) in guiding the model's reasoning capabilities.

## Code / Notebook
This section contains an overview of the GitHub implementation of Lenna: [GitHub Repo](https://github.com/ujwalkirsan/Lenna_MMDP_Project)

```python
# Example usage placeholder (this section would include actual demo code if integrated locally)
# from lenna import LennaModel
# model = LennaModel()
# result = model.detect_reason(image_path="example.jpg", question="What shows he is playing sports?")
# print(result)
```


## Reflections
**What surprised me?**
- The simplicity and effectiveness of using a special token (`<DET>`) for activating object detection mode.
- How reasoning-based prompts significantly affect model performance compared to direct object queries.

**Scope for improvement:**
- Expand reasoning capabilities to video and real-time applications.
- Improve generalization to unseen types of implicit reasoning in natural language.


## References
- Lenna Paper: *Lenna: Language Enhanced Reasoning Detection Assistant*
- [GitHub Repository](https://github.com/ujwalkirsan/Lenna_MMDP_Project)
- DetGPT, MiniGPT-v2, Grounding-DINO
- [BLIP-2 Paper](https://arxiv.org/abs/2301.12597)
- [LLaVA: Visual Instruction Tuning](https://arxiv.org/abs/2304.08485)
