<a href="https://colab.research.google.com/github/j-min/VL-T5/blob/main/inference_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# VL-T5 inference on custom images with Huggingface Faster R-CNN

## (Update) Difference in Faster R-CNN features
The Faster R-CNN (FRCNN) used in this repo is adapted from [Hugginface LXMERT demo](https://github.com/huggingface/transformers/tree/master/examples/research_projects/lxmert)).
While this Hugginface FRCNN implementation is easy to work with custom images, we recently found that the Huggingface FRCNN provides slightly different features from the FRCNN features used in LXMERT and VL-T5.

While the FRCNN works okay for this demo purpose, sometimes the pretrained VL-T5 fails with Huggingface FRCNN feature and evaluation with this feature would yield degraded performance.

To use the exactly same feature extractor used in VL-T5 pretraining, you can check out
[LXMERT github repo](https://github.com/airsplay/lxmert) or [Detectron2 based FRCNN](https://github.com/airsplay/py-bottom-up-attention). Both repos are written by [Hao Tan](https://github.com/airsplay), the author of LXMERT.


## Download code and install dependencies

In [None]:
!git clone https://github.com/j-min/VL-T5

Cloning into 'VL-T5'...
remote: Enumerating objects: 184, done.[K
remote: Counting objects: 100% (184/184), done.[K
remote: Compressing objects: 100% (106/106), done.[K
remote: Total 184 (delta 95), reused 154 (delta 73), pack-reused 0[K
Receiving objects: 100% (184/184), 897.61 KiB | 22.44 MiB/s, done.
Resolving deltas: 100% (95/95), done.


In [None]:
cd VL-T5

/content/VL-T5


In [None]:
!pip uninstall param -y # to resolve name conflict with src.param.py
!pip install -r requirements.txt
!python download_backbones.py

Uninstalling param-1.10.1:
  Successfully uninstalled param-1.10.1
Collecting git+git://github.com/j-min/language-evaluation@master (from -r requirements.txt (line 12))
  Cloning git://github.com/j-min/language-evaluation (to revision master) to /tmp/pip-req-build-kkr_ggfz
  Running command git clone -q git://github.com/j-min/language-evaluation /tmp/pip-req-build-kkr_ggfz
Collecting torch==1.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/5d/5e/35140615fc1f925023f489e71086a9ecc188053d263d3594237281284d82/torch-1.6.0-cp37-cp37m-manylinux1_x86_64.whl (748.8MB)
[K     |████████████████████████████████| 748.8MB 15kB/s 
[?25hCollecting transformers==4.2.1
[?25l  Downloading https://files.pythonhosted.org/packages/cd/40/866cbfac4601e0f74c7303d533a9c5d4a53858bd402e08e3e294dd271f25/transformers-4.2.1-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 36.2MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/a

## Download the pretrained checkpoint

In [None]:
import gdown

In [None]:
!mkdir -p VL-T5/snap/pretrain/VLT5

In [None]:
gdown.download('https://drive.google.com/uc?id=100qajGncE_vc4bfjVxxICwz3dwiAxbIZ', 'VL-T5/snap/pretrain/VLT5/Epoch30.pth', quiet=False)

Downloading...
From: https://drive.google.com/uc?id=100qajGncE_vc4bfjVxxICwz3dwiAxbIZ
To: /content/VL-T5/VL-T5/snap/pretrain/VLT5/Epoch30.pth
898MB [00:10, 87.7MB/s]


'VL-T5/snap/pretrain/VLT5/Epoch30.pth'

## Add source code path

In [None]:
import sys

In [None]:
sys.path.append('/content/VL-T5/VL-T5/src')
sys.path.append('/content/VL-T5/VL-T5/inference')

In [None]:
cd VL-T5

/content/VL-T5/VL-T5


## Build a model and load weights from the pretrained checkpoint

In [None]:
!pip uninstall param -y



In [None]:
from param import parse_args

In [None]:
args = parse_args(
    parse=False,
    backbone='t5-base',
    load='snap/pretrain/VLT5/Epoch30'
)
args.gpu = 0

In [None]:
from vqa import Trainer

In [None]:
trainer = Trainer(args,
                  train=False
                  )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…


Building Model at GPU 0


Some weights of VLT5VQA were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded from  snap/pretrain/VLT5/Epoch30.pth
_IncompatibleKeys(missing_keys=[], unexpected_keys=['encoder.visual_embedding.layer_norm.weight'])
Model Launching at GPU 0
It took 10.4s


# Faster R-CNN inference script (from [Huggingface transformers LXMERT demo](https://github.com/huggingface/transformers/tree/master/examples/research_projects/lxmert))

In [None]:
from IPython.display import clear_output, Image, display
import PIL.Image
import io
import json
import torch
import numpy as np
from inference.processing_image import Preprocess
from inference.visualizing_image import SingleImageViz
from inference.modeling_frcnn import GeneralizedRCNN
from inference.utils import Config, get_data

import wget
import pickle
import os


URL = "https://raw.githubusercontent.com/airsplay/py-bottom-up-attention/master/demo/data/images/input.jpg"
OBJ_URL = "https://raw.githubusercontent.com/airsplay/py-bottom-up-attention/master/demo/data/genome/1600-400-20/objects_vocab.txt"
ATTR_URL = "https://raw.githubusercontent.com/airsplay/py-bottom-up-attention/master/demo/data/genome/1600-400-20/attributes_vocab.txt"
GQA_URL = "https://raw.githubusercontent.com/airsplay/lxmert/master/data/gqa/trainval_label2ans.json"
VQA_URL = "https://raw.githubusercontent.com/airsplay/lxmert/master/data/vqa/trainval_label2ans.json"

objids = get_data(OBJ_URL) 
attrids = get_data(ATTR_URL)
gqa_answers = get_data(GQA_URL) 
vqa_answers = get_data(VQA_URL) 
frcnn_cfg = Config.from_pretrained("unc-nlp/frcnn-vg-finetuned")
frcnn = GeneralizedRCNN.from_pretrained("unc-nlp/frcnn-vg-finetuned", config=frcnn_cfg) 
image_preprocess = Preprocess(frcnn_cfg) 

# for visualizing output
def showarray(a, fmt='jpeg'):
    a = np.uint8(np.clip(a, 0, 255))
    f = io.BytesIO()
    PIL.Image.fromarray(a).save(f, fmt)
    display(Image(data=f.getvalue()))

%s not found in cache or force_download set to True, downloading to %s https://s3.amazonaws.com/models.huggingface.co/bert/unc-nlp/frcnn-vg-finetuned/config.yaml /root/.cache/torch/transformers/tmp2hiwish2


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2132.0, style=ProgressStyle(description…


loading configuration file cache
%s not found in cache or force_download set to True, downloading to %s https://cdn.huggingface.co/unc-nlp/frcnn-vg-finetuned/pytorch_model.bin /root/.cache/torch/transformers/tmpeamu3hf1


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=262398754.0, style=ProgressStyle(descri…


loading weights file https://cdn.huggingface.co/unc-nlp/frcnn-vg-finetuned/pytorch_model.bin from cache at /root/.cache/torch/transformers/57f6df6abe353be2773f2700159c65615babf39ab5b48114d2b49267672ae10f.77b59256a4cf8343ae0f923246a81489fc8d82f98d082edc2d2037c977c0d9d0
All model checkpoint weights were used when initializing GeneralizedRCNN.

All the weights of GeneralizedRCNN were initialized from the model checkpoint at unc-nlp/frcnn-vg-finetuned.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GeneralizedRCNN for predictions without further training.


In [None]:
image_filename = wget.download(URL)

In [None]:
image_dirname = image_filename
frcnn_visualizer = SingleImageViz(image_filename, id2obj=objids, id2attr=attrids) 

images, sizes, scales_yx = image_preprocess(image_filename) 

output_dict = frcnn(
    images, 
    sizes, 
    scales_yx = scales_yx, 
    padding = 'max_detections', 
    max_detections = frcnn_cfg.max_detections, 
    return_tensors = 'pt' 
)

# add boxes and labels to the image 
frcnn_visualizer.draw_boxes(
    output_dict.get("boxes"), 
    output_dict.get("obj_ids"),
    output_dict.get("obj_probs"),
    output_dict.get("attr_ids"), 
    output_dict.get("attr_probs"),
)

showarray(frcnn_visualizer._get_buffer())

normalized_boxes = output_dict.get("normalized_boxes") 
features = output_dict.get("roi_features") 

<IPython.core.display.Image object>

## Load Tokenizer

In [None]:
from tokenization import VLT5TokenizerFast

In [None]:
tokenizer = VLT5TokenizerFast.from_pretrained('t5-base')

## Inference

In [None]:
questions = ["vqa: What is the main doing?", 
             "vqa: What color is the clothing the man wears?", 
             "vqa: What color is the horse?",] 

In [None]:
for question in questions:
    input_ids = tokenizer(question, return_tensors='pt', padding=True).input_ids
    batch = {}
    batch['input_ids'] = input_ids
    batch['vis_feats'] = features
    batch['boxes'] = normalized_boxes

    result = trainer.model.test_step(batch)
    print(f"Q: {question}")
    print(f"A: {result['pred_ans'][0]}")

Q: vqa: What is the main doing?
A: riding
Q: vqa: What color is the clothing the man wears?
A: blue
Q: vqa: What color is the horse?
A: black
