Skip to content

This is a multimodal model design for the Vision Question Answering (VQA) task. It integrates the Llama2 13B, OWL-ViT, and YOLOv8 models.

Notifications You must be signed in to change notification settings

ycchen218/VisionQA-Llama2-OWLViT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VisionQA-Llama2-OWLViT

Introduce

This is a multimodal model design for the Vision Question Answering (VQA) task. It integrates the Llama2 13B, OWL-ViT, and YOLOv8 models, utilizing hard prompt tuning.

features:

  1. Llama2 13B handles language understanding and generation.
  2. OWL-ViT identifies objects in the image relevant to the question.
  3. YOLOv8 efficiently detects and annotates objects within the image

Combining these models leverages their strengths for precise and efficient VQA, ensuring accurate object recognition and context understanding from both language and visual inputs.

Requirement

pip install requirements.txt

Data

I evaluate the testing data from the GQA dataset.

Eval

python val_zero_shot.py 

--imgs_path: The path of the GQA data image file
--dataroot: The path of the GQA data
--mode: ['testdev', 'val', 'train']

Run

python zero_shot.py

--img_path: The path of the question image
--yolo_weight: The pre-train yolov8 weight

Predict result

  1. The resutl of GQA accuracy score is 0.52.

image

About

This is a multimodal model design for the Vision Question Answering (VQA) task. It integrates the Llama2 13B, OWL-ViT, and YOLOv8 models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages