Skip to content

reshalfahsi/vqa-clip-lstm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Visual Question Answering Using CLIP + LSTM

colab
architecture CLIP + LSTM architecture.

The visual question-answering problem can be described as "asking our computer to reply to the assigned questions about a particular image." In this project, a CLIP + LSTM architecture comes to lend a helping hand, trying to solve the problem. The image and text encoders of CLIP cultivate the given image and question, respectively. The concatenated image-text representation from CLIP is then applied to the vectorized answer text via the Hadamard product before feeding it to LSTM. By a fashion of autoregressive, the answer to the question is finally served to us. Here, the VizWiz-VQA dataset is utilized to train, validate, and test the model. The training set of the dataset is used in the training and validation phases. It is divided by a ratio of 99:1. The validation set of the dataset is employed for testing. The SQuAD and BLEU metrics are utilized to gauge the performance of the model quantitatively. In inference time, the test set of VizWiz-VQA is leveraged.

Experiment

Give yourself a delightful excursion by passing through the line of codes regarding the experiment provided in this notebook.

Result

Quantitative Result

Here are the evaluation metric results of the model.

SQuAD Metric Score
BLEU 1-gram 44.67%
Exact Match 44.43%
F1-score 44.83%

Loss Curve

loss_curve
Loss curves of the CLIP + LSTM model on the train and validation sets.

Qualitative Result

The following image exhibits the collated results of the VQA model.

qualitative
A collection of qualitative results containing question-answer-image triads.

Credit