The visual question-answering problem can be described as "asking our computer to reply to the assigned questions about a particular image." In this project, a CLIP + LSTM architecture comes to lend a helping hand, trying to solve the problem. The image and text encoders of CLIP cultivate the given image and question, respectively. The concatenated image-text representation from CLIP is then applied to the vectorized answer text via the Hadamard product before feeding it to LSTM. By a fashion of autoregressive, the answer to the question is finally served to us. Here, the VizWiz-VQA dataset is utilized to train, validate, and test the model. The training set of the dataset is used in the training and validation phases. It is divided by a ratio of 99:1. The validation set of the dataset is employed for testing. The SQuAD and BLEU metrics are utilized to gauge the performance of the model quantitatively. In inference time, the test set of VizWiz-VQA is leveraged.
Give yourself a delightful excursion by passing through the line of codes regarding the experiment provided in this notebook.
Here are the evaluation metric results of the model.
SQuAD Metric | Score |
---|---|
BLEU 1-gram | 44.67% |
Exact Match | 44.43% |
F1-score | 44.83% |
Loss curves of the CLIP + LSTM model on the train and validation sets.
The following image exhibits the collated results of the VQA model.
A collection of qualitative results containing question-answer-image triads.
- An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Learning Transferable Visual Models From Natural Language Supervision
- Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model
- Long Short-Term Memory
- SQuAD: 100,000+ Questions for Machine Comprehension of Text
- A Call for Clarity in Reporting BLEU Scores
- CLIP
- LLaMA 2 from scratch 🦙
- yousefkotp's Visual Question Answering
- aladdinpersson's Image Captioning
- VizWiz-VQA
- PyTorch Lightning