## VLBart PyReft Integration
This is my preliminary try on integrating VLBart with PyReft.
### Instructions
1. Use Pyvene's peterwz-llava branch and PyReft's peterwz-llava branch.
2. Head to pyreft/examples/vlbart/DoRA/image_video_text_understanding, and install packages with the same version as the requirements.txt there. Note that DoRA requires a much less transformers version.
3. Download dataset according to the instructions in pyreft/examples/vlbart/DoRA/image_video_text_understanding/README.md, specifically, go to the google drive link and download processed CLIP features. Put it in pyreft/examples/vlbart/DoRA/datasets/ In this notebook we only process on VQA features.
4. In image_video_text_understanding/download_backbones.py, change the cache directory to your directory storing the models.
5. Try run image_video_text_understanding/VL-T5/scripts/image/dora.sh to see if your DoRA (VLBart model) is installed successfully.
6. Run this notebook.
### Known Issues
1. The model generation results may be incorrect, as you can see from the generation experiments below. Intervention locations are all untested.
2. I removed the "+1" padding when performing PyReft interventions (you can see that "padding" is "none" instead of "left")
3. The training is fast in first few steps, then become very slow. I suspect that is related to the data loading cache behavior. Batching the dataset loading process, instead of the lazy data loading we are using now with ReftDataloaderDataset, may be a better option.

In [1]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), 'DoRA/image_video_text_understanding/VL-T5/src')))

In [2]:
import vqa_clip_data

In [3]:
vqa_args = {'RefCOCO_BUTD': False,
 'RefCOCO_GT': False,
 'adam_beta1': 0.9,
 'adam_beta2': 0.999,
 'adam_eps': 1e-06,
 'add_adapter_cross_attn': True,
 'add_layer_norm_after_adapter': False,
 'add_layer_norm_before_adapter': False,
 'additional_visual_embedding_layers': 0,
 'answer_normalize': False,
 'backbone': 'facebook/bart-base',
 'batch_size': 512,
 'caption_cocoonly': True,
 'caption_only': False,
 'classifier': False,
 'clip_grad_norm': 5.0,
 'cls_task': 'tinyimagenet',
 'coco_only': False,
 'comment': '',
 'decoder_prompt_len': 0,
 'deepspeed': None,
 'distributed': False,
 'do_lower_case': False,
 'dora_simple': False,
 'downsample': True,
 'dropout': 0.1,
 'dry': False,
 'efficient_unique_hyper_net': False,
 'encoder_prompt_len': 0,
 'epochs': 20,
 'expand_vis_embedding': False,
 'factorized_phm': True,
 'feat_dim': 2048,
 'feature_type': 'RN101',
 'fp16': False,
 'freeze_bn_statistics': False,
 'freeze_ln_statistics': False,
 'from_scratch': False,
 'full_determinism': False,
 'gen_max_length': 20,
 'gpu': 0,
 'gradient_accumulation_steps': 1,
 'ground_upsample': 1,
 'ground_weight': 1,
 'hypercomplex_division': 4,
 'image_size': '(224,224)',
 'individual_vis_layer_norm': True,
 'itm_cocoonly': True,
 'lambda_z': 0.001,
 'load': None,
 'load_lxmert_qa': None,
 'local_rank': 0,
 'log_train_accuracy': False,
 'lora_alpha': 32,
 'lora_dim': 128,
 'lora_settings': True,
 'losses': 'lm,obj,attr,feat',
 'low_rank_rank': 1,
 'lr': 0.001,
 'max_n_boxes': 36,
 'max_text_length': 20,
 'mid_dim': 768,
 'multiGPU': True,
 'multitask_sampling': 'roundrobin',
 'n_boxes': 36,
 'n_ground': 1,
 'n_image_tokens': 4,
 'no_prefix': False,
 'num_beams': 5,
 'num_workers': 4,
 'obj_mask_rate': 0.15,
 'oneddownsample': False,
 'optim': 'adamw',
 'optimizer': 'adamw',
 'oscar_tags': False,
 'output': 'snap/VLBart_multitask/tune+lr1e-3_plzplz2',
 'phm_init_range': 0.01,
 'phm_rank': 1,
 'pos_dim': 4,
 'post_prompt': '',
 'prefix': None,
 'project_name': 'RN101_LMsingle_dora_128_bs300_image224_lora_settings',
 'projected_task_embedding_dim': -1,
 'prompt': 'vqa: ',
 'raw_label': False,
 'reduction_factor': 16,
 'remove_bn_vis_adapter': False,
 'run_name': 'tune+lr1e-3_plzplz2',
 'seed': 9595,
 'share_down_sampler': False,
 'share_up_sampler': False,
 'share_vis_lang_layer_norm': False,
 'shared_phm_rule': True,
 'shared_phm_rule_over_tasks': False,
 'shuffle_boxes': False,
 'single_vqa_prefix': False,
 'sparse_sample': False,
 'submit': False,
 'tasks': 'vqa',
 'test': None,
 'test_answerable': False,
 'test_only': False,
 'testing': False,
 'tokenizer': None,
 'track_z': False,
 'train': 'train',
 'train_topk': -1,
 'unfreeze_batch_norms': False,
 'unfreeze_bias': True,
 'unfreeze_decoder_layer_norms': False,
 'unfreeze_encoder_layer_norms': False,
 'unfreeze_language_model': False,
 'unfreeze_layer_norms': True,
 'unfreeze_lm_head': False,
 'unfreeze_vis_encoder': False,
 'unfreeze_vis_last_layer': False,
 'unique_hyper_net': False,
 'use_adam_for_visual': False,
 'use_adapter': False,
 'use_attn_prefix': False,
 'use_compacter': False,
 'use_data_augmentation': False,
 'use_dora': True,
 'use_hyperformer': False,
 'use_lm_head_adapter': False,
 'use_lora': False,
 'use_lradapter': False,
 'use_separate_optimizer_for_visual': False,
 'use_single_adapter': False,
 'use_single_lora': False,
 'use_single_prompt': False,
 'use_tasks_prompts': True,
 'use_vis_adapter': False,
 'use_vis_layer_norm': True,
 'use_vis_order_embedding': True,
 'use_vision': True,
 'valid': 'valid',
 'valid_batch_size': 512,
 'valid_topk': -1,
 'vis_adapter_type': 'middle-bottleneck',
 'vis_lr': 0.0001,
 'vis_pointer': False,
 'vis_pooling_output': False,
 'vis_reduction_factor': 2,
 'vis_use_transformer': False,
 'vis_weight_decay': 0.01,
 'warmup_ratio': 0.1,
 'weight_decay': 0.01,
 'word_mask_rate': 0.15,
 'world_size': 1}

In [4]:
from types import SimpleNamespace
args = SimpleNamespace(**vqa_args)

In [5]:
train_loaders = []
vqa_train_loader = vqa_clip_data.get_loader(
    args,
    split='karpathy_train', mode='train', batch_size=args.batch_size,
    distributed=args.distributed, gpu=0,
    workers=args.num_workers,
    topk=args.train_topk,
)
train_loaders.append(vqa_train_loader)

Load 605102 data from split(s) karpathy_train.
# Answers: 3129
Data sources:  ['karpathy_train']
Loaded 605102 data from karpathy_train
# all sentences: 605102




In [6]:
from pyreft.dataset import ReftDataset, ReftDataloaderDataset
from pyreft import (
    ReftTrainerForCausalLM, 
    ReftDataCollator,
    LoreftIntervention,
    TaskType,
    ReftConfig,
    get_reft_model,
)
import torch

In [7]:
class VLBartDataset(ReftDataloaderDataset):
    """
    A ReftClassificationDataset only contains a single text field
    that we tokenize, intervene on a prefix + suffix of, and
    compute subspace settings for. This is intended for classification
    tasks.

    Remember to pass in the input_field and label_field as kwargs.
    """
    def load_dataset(self):
        """Load the dataset (or a portion of it) from HF or a local file."""

        self.task_dataset = self.dataloader.dataset
        self.collate_fn = self.task_dataset.collate_fn
        self.pad_mode = "none"

        # select n random examples if specificed
        if self.max_n_example is not None:
            self.task_dataset = torch.utils.data.Subset(self.task_dataset, list(range(self.max_n_example)))

        # save raw_dataset pointer for access raw strings
        self.raw_dataset = self.task_dataset if self.data_split != "train" else None
        return self.task_dataset

    def preprocess(self, kwargs):
        self.input_field = "input_ids"
        self.label_field = "target_ids"

    def tokenize(self, data_item):
        result = {**data_item}
        # result[self.label_field] = 
        
        last_position = len(data_item[self.input_field]) - 1
            
        return result, last_position

In [8]:
from transformers import BartTokenizer, TrainingArguments
tokenizer = BartTokenizer.from_pretrained(
    args.backbone,
    max_length=args.max_text_length,
    do_lower_case=args.do_lower_case
)

In [9]:
layers = [1,2,3]
position = "f1+l1"

In [10]:
train_dataset = VLBartDataset(
    "vqa", 
    tokenizer, data_split="train", 
    dataloader=vqa_train_loader,
    max_n_example=1000,
    **{"num_interventions": len(layers), "position": position, 
       "share_weights": True, "test_split": "validation"}
)
eval_dataset = VLBartDataset(
    "vqa", 
    tokenizer, data_split="val", 
    dataloader=vqa_train_loader,
    max_n_example=1000,
    **{"num_interventions": len(layers), "position": position, 
       "share_weights": True, "test_split": "validation"}
)

In [11]:
from multitask import Trainer
trainer = Trainer(args, vqa_train_loader, None, None, train=True)

Building Model at GPU 0
Model Launching at GPU 0
model.encoder.visual_embedding.feat_embedding.0.weight is trainable...
model.encoder.visual_embedding.feat_embedding.0.bias is trainable...
model.encoder.visual_embedding.feat_embedding.1.weight is trainable...
model.encoder.visual_embedding.feat_embedding.1.bias is trainable...
model.encoder.visual_embedding.absolute_vis_pos_embedding.0.weight is trainable...
model.encoder.visual_embedding.absolute_vis_pos_embedding.0.bias is trainable...
model.encoder.visual_embedding.absolute_vis_pos_embedding.1.weight is trainable...
model.encoder.visual_embedding.absolute_vis_pos_embedding.1.bias is trainable...
model.encoder.visual_embedding.img_order_embedding.weight is trainable...
apply dora tuning
model.encoder.layers.0.self_attn.k_proj.bias is trainable...(768)
model.encoder.layers.0.self_attn.v_proj.bias is trainable...(768)
model.encoder.layers.0.self_attn.q_proj.bias is trainable...(768)
model.encoder.layers.0.self_attn.out_proj.bias is tra



In [12]:
model = trainer.model

In [13]:
print(model.config)

BartConfig {
  "RefCOCO_BUTD": false,
  "RefCOCO_GT": false,
  "_name_or_path": "facebook/bart-base",
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "adam_beta1": 0.9,
  "adam_beta2": 0.999,
  "adam_eps": 1e-06,
  "adapter_config": null,
  "add_adapter_cross_attn": true,
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "add_layer_norm_after_adapter": false,
  "add_layer_norm_before_adapter": false,
  "additional_visual_embedding_layers": 0,
  "answer_normalize": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.1,
  "backbone": "facebook/bart-base",
  "batch_size": 512,
  "bos_token_id": 0,
  "caption_cocoonly": true,
  "caption_only": false,
  "classif_dropout": 0.1,
  "classifier": false,
  "classifier_dropout": 0.0,
  "clip_grad_norm": 5.0,
  "cls_task": "tinyimagenet",
  "coco_only": false,
  "comment": "",
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layer

In [14]:
train_dataset.collate_fn

<bound method VQAFineTuneDataset.collate_fn of <vqa_clip_data.VQAFineTuneDataset object at 0x7fdf5355f490>>

In [15]:
print(vqa_train_loader.dataset[0].keys())

dict_keys(['args', 'img_id', 'vis_feats', 'boxes', 'question_id', 'sent', 'input_ids', 'input_length', 'is_topk_optimal', 'label', 'answer', 'score', 'all_answers', 'target_ids', 'target_length'])


In [16]:
# from transformers import DataCollatorForSeq2Seq
# data_collator_fn = DataCollatorForSeq2Seq(
#     tokenizer=tokenizer,
#     model=model,
#     label_pad_token_id=-100,
#     padding="longest"
# )
import transformers
def keep_intervention_locations(datum):
    new_data = {}
    new_data["input_ids"] = datum["input_ids"]
    # new_data["labels"] = datum["labels"]
    new_data["intervention_locations"] = datum["intervention_locations"]
    new_data["attention_mask"] = datum["attention_mask"]
    return new_data

def custom_collate_fn(data):
    collate_fn_1 = train_dataset.collate_fn
    collate_fn_2 = transformers.DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        label_pad_token_id=-100,
        padding="longest"
    )
    # print(data[0].keys())
    output_1 = collate_fn_1(data)
    # print("LABEL:", output["labels"])
    custom_data = [keep_intervention_locations(item) for item in data]
    output_2 = collate_fn_2(custom_data)
    output = output_1
    output["intervention_locations"] = output_2["intervention_locations"]
    # output["id"] = output_2["id"]
    # output["labels"] = output_2["labels"]
    
    output["attention_mask"] = output_2["attention_mask"]

    ids = []
    for d in data:
        ids.append(d["id"])
    import numpy as np
    output["id"] = np.array(ids)
    
    output["logits"] = output["labels"]
    output["labels"] = output["target_ids"]    

    return output

data_collator = ReftDataCollator(data_collator=custom_collate_fn)

In [17]:
rank = 1
dropout=0.05


In [18]:
representations = [{
    "layer": l, "component": "block_output",
    "low_rank_dimension": rank,
    "intervention": LoreftIntervention(
        embed_dim=model.config.d_model, low_rank_dimension=rank,
        dropout=dropout, dtype=torch.float32, act_fn=None, device="cuda",
        add_bias=True
    )
} for l in layers]
task_type=TaskType.CAUSAL_LM

reft_config = ReftConfig(representations=representations)

In [19]:
reft_model = get_reft_model(model, reft_config)
reft_model.print_trainable_parameters()

trainable intervention params: 4,611 || trainable model params: 0
model params: 148,262,400 || trainable%: 0.0031100265475265476


In [20]:
training_args = TrainingArguments(
    output_dir="random",
    run_name="random",
    num_train_epochs=1,
    per_device_train_batch_size=args.batch_size,
    per_device_eval_batch_size=args.batch_size,
    gradient_accumulation_steps=1,
    evaluation_strategy="epoch",
    save_strategy="no",
    metric_for_best_model=None,
    load_best_model_at_end=False,
    logging_strategy="steps",
    save_total_limit=1, # for GLUE, it will save 2 at max.
    logging_steps=1,
    learning_rate=1e-3,
    warmup_ratio=0.05,
    optim="adamw_torch",
    weight_decay=0,
    report_to="none",
    use_cpu=False,
    seed=42,
    # until HF supports ReFT, this remains False! :)
    remove_unused_columns=False
)

In [21]:
# from torch.utils.data import DataLoader

# class MyTrainer(ReftTrainerForCausalLM):
#     def get_train_dataloader(self) -> DataLoader:
#         return make_dataloader(self.train_dataset, self._train_batch_size, self.data_collator, shuffle=True)


In [22]:
trainer = ReftTrainerForCausalLM(
    model=reft_model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    compute_metrics=None,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [23]:
print(reft_model.model)

VLBartMultiTask(
  (model): VLBartModel(
    (shared): Embedding(50465, 768)
    (encoder): JointEncoder(
      (embed_tokens): Embedding(50465, 768)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768, padding_idx=1)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): DoraLinear(in_features=768, out_features=768, bias=True, lora_dim=128, lora_scale=1.0)
            (q_proj): DoraLinear(in_features=768, out_features=768, bias=True, lora_dim=128, lora_scale=1.0)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
         

In [24]:
trainer.train()

{'loss': 0.0005, 'learning_rate': 0.001, 'epoch': 0.5}
{'loss': 0.0009, 'learning_rate': 0.0, 'epoch': 1.0}
{'eval_loss': 3.827400360023603e-05, 'eval_runtime': 1.6025, 'eval_samples_per_second': 624.033, 'eval_steps_per_second': 1.248, 'epoch': 1.0}
{'train_runtime': 3.8445, 'train_samples_per_second': 260.11, 'train_steps_per_second': 0.52, 'train_loss': 0.0006809134501963854, 'epoch': 1.0}


TrainOutput(global_step=2, training_loss=0.0006809134501963854, metrics={'train_runtime': 3.8445, 'train_samples_per_second': 260.11, 'train_steps_per_second': 0.52, 'train_loss': 0.0006809134501963854, 'epoch': 1.0})

In [25]:
tokenizer("tree")

{'input_ids': [0, 21512, 2], 'attention_mask': [1, 1, 1]}

In [26]:
reft_model.model.eval()
for k,v in reft_model.interventions.items():
    _ = v[0].eval()


In [27]:
from compute_metrics import compute_metrics
generations, stats = compute_metrics(
    "vqa", "vqa", reft_model, tokenizer, eval_dataset, eval_dataset,
    '', 'test', args.batch_size, 
    data_collator,
    split=False, greedy_decoding=True, temperature=1.0, top_p=None, top_k=None
)


  0%|                                                                                                                        | 0/2 [00:16<?, ?it/s, em=0]


000000000300000000:000000/000000.000000100000
AtIStoppelisisislamisperiamsperiamisis
I'm a professional baseball player?
I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m
At the persons headwear?
000000000aditalitalitaliammer's handEl000000300000000
”” event event at event event event� event event moment event event –
South Africa’s tennis ball or frisbee?
”” event event at event event event� event event moment event event –
I (I)I (Limpitalitalitali000000300000000
I/I/Em/impimpac//m/i/m
”” event event at event event event� event event moment event event –
RestRestRest RestRest Rest Rest RestRestRest of RestRest ofRest Rest of
I/I/Em/impimpac//m/i/m
”” moment” event event event moment moment moment event moment event event
”” event event at event event event� event event moment event event –
is the business man doing in the picture?
Prostoppet well with his suit?
I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m
isis wearing a plain tie?
vqa: Judging from the dress, was this taken in a Latin 

  0%|                                                                                                                        | 0/2 [00:16<?, ?it/s, em=0]

”” event event at event event event� event event moment event event –
I am in the picture?
”” event event at event event event� event event moment event event –


I/I/Em/impimpac//m/i/m
TheRestRest RestRestRestRestfastRestRestrestRestResteventualRestRest
is closer to the camera?
I/I/Em/impimpac//m/i/m
I (not) in this photo?
I’m the man standing next to?
At a cell phone screen?
000000000adendendendammeri000000amperamperi
100000000300000000000:000000/0000001000000100

I.I.E.000000000.000I.000/000000
”” event event at event event event� event event moment event event –
is visible?
”” event event at event event event� event event moment event event –
I/I/Em/impimpac//m/i/m
Theoppeloppoppelappeloppelimpeloppetoppelfast
is on the man's shirt?
”” event event at event event event� event event moment event event –
I’m going to share the bananas?
I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m
is painted on the vase?
Stoppits match the real ones?
”” event event at event event event� event event mo

  0%|                                                                                                                        | 0/2 [00:16<?, ?it/s, em=0]

”” moment” event event event moment moment moment event moment event event
Ikendendendtendenddendendiendendisendend
isputputput on the stomach on the pizza?
I was eaten for breakfast?
peri from a delicatessen?
is on the potatoes?
““I’t“t’s’�
”” event event at event event event� event event moment event event –
Stoppeloppelis the was the fries?
”” event event at event event event� event event moment event event –
IRestRestRestitalitalitalisendendendelisisislamis
000000000amamis are these?
the pattern on the vase?
Eloppoppoppeloppelkundertendendendred.000000


On the toilet tank tank tank
GETGETTY: You can’t get the outside?
I/I/Em/impimpac//m/i/m
I have a glass door?
I.I.Julululmulmurturtiememem
I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m
RestRestRest RestRestRestrestRest Rest Rest RestRestrestoppeloppel
000000000isis coming from?
I/I/Em/impimpac//m/i/m
I am in this cake?
 100000000000.000000,000000ElisElispublicisEl
At the time of the event at this event?
00000

 50%|████████████████████████████████████████████████████████                                                        | 1/2 [00:16<00:16, 16.74s/it, em=0]

I/I/Em/impimpac//m/i/m
Iammer at the top?
I (I)I (Prostoppel )(ProstiProst
isis well-groomed?

At the top of the building?
I/I/Em/impimpac//m/i/m
turlurlurlputputputsputput
Eloppoppoppeloppelitalitalitalier,000000000,000
I/I/Em/impimpac//m/i/m
”” event event at event event event� event event moment event event –

Eloppoppoppeloppelitalitalitalisendendendisendis
At the bottom of the tower?
lurti – high high high rise?
At the time of this tower made of wood?
I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m
I/I/Eburtis/ElisElisispublicispublic
tendelasputput in this photo?
000000000meris standing on?000000300000000E000000ad

000000000/000000ad000000am000000.000000m000
is' fleece dirty?
appelappeloppeloppetoppel(eventful)000000000
I/I/Em/impimpac//m/i/m
is on the plates?
is?
”” moment” event event event moment moment moment event moment event event
Stoppeloppelappel.com.auknotnotElknot
?
putputputnotputputsputputgetputputtputput
I/I/Em/impimpac//m/i/m
”” moment” event e

 50%|████████████████████████████████████████████████████████                                                        | 1/2 [00:29<00:16, 16.74s/it, em=0]

I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m
is on the plate?
I/I/Em/impimpac//m/i/m
Stoppoppoppelisislammeriaditalismeriammer
000000000 at a game?
I/I/Em/impimpac//m/i/m
Iamam a movie?
I/RestRestRestendendendis /I /I/I/L
 the person wearing?
I/I/Em/impimpac//m/i/m
appendendenddisdisappenditaliamputs eating?
I am in the picture?
”” moment” event event event moment moment moment event moment event event
000000000300000000ad000000s down?0000001000000
RestRestRest Rest RestRest RestRestRestrestRestRestStopposedRestRest
At the end of the day?
”” event event at event event event� event event moment event event –
I/I/Em/impimpac//m/i/m
Stoppendendendadendenddendendtendendiendend
Ladendendend in Dendendi Denditali Stadend

I/I/Em/impimpac//m/i/m
”” event event at event event event� event event moment event event –
”” moment” event event event moment moment moment event moment event event
isisisendendendisenditalisendisismeriendend
000000000i000000adis (sadadis000000is000
AtI.I.000000000.000,

 50%|████████████████████████████████████████████████████████                                                        | 1/2 [00:30<00:16, 16.74s/it, em=0]

(public)public.publicpublicpublic.compublic.000.000publicpublic

I/I/Em/impimpac//m/i/m
”” event event at event event event� event event moment event event –
I/I/Em/impimpac//m/i/m
is the man cooking so many hot dogs?
eventevent new or ancient clock?
stoppoppel clock read?
IJSESESEJSEJIJEJSEpublicispublicis
ElisElisI have made personality?
The rest of them are in the picture?
000000000adurt taken at at?000000/000000urt at?
isntoppeloppelammlammeriammeriamtoppelesc
is this?
On the man doing in the picture?
I’m the name of the cartoon man presenting to the cartoon woman?
I/I/Em/impimpac//m/i/m
At the end of the day with the white arrow say?
EladisEladelisElisElaseEladElisisEl
What is above the "No Left Turn" sign?
000000000 at a dress?
I/I/Em/impimpac//m/i/m
Eladurtoppel fall forward?
I am a person and a woman?
I/I/Em/impimpac//m/i/m


I/I/Em/impimpac//m/i/m
I get the sun in the picture?
II was an infantElElIIIElElEl000000000I
Thedisdisdisappelas on the babies shirt?
”” moment” event eve

 50%|████████████████████████████████████████████████████████                                                        | 1/2 [00:30<00:16, 16.74s/it, em=0]

000000000adadel batter from?III/I/000000
is jugs are visible?
Stoppel the picket fence?
I/I/Em/impimpac//m/i/m
disappear (disdisappevent?
000000000.000000,000,100,000.100.000.
”” event event at event event event� event event moment event event –
At least one person on TV wearing a tie?
appear above the TV set?
”” event event at event event event� event event moment event event –
At this woman's dress?
I’tIStammer“I“Lammeri
AtRestRestRestInRestRestElisElisRestRestisElElis
Stoppelis/Stadelis / Stoppeli/Stopp
is on the man's bike?
I/I/Em/impimpac//m/i/m
”” event event at event event event� event event moment event event –
Southwest police official's plane?
”” event event at event event event� event event moment event event –
I/I/Em/impimpac//m/i/m
000000000 at a game?
”” moment” event event event moment moment moment event moment event event
000000000I was made made of?
is in a field?
is there?

I had been to this intersection?

RestRestRest RestRestRestoppelis /StoppeliRestResti
I’m one 

 50%|████████████████████████████████████████████████████████                                                        | 1/2 [00:30<00:16, 16.74s/it, em=0]

”” event event at event event event� event event moment event event –
AtRestRestRestendendendisisis/Stoppadisendend
”” event event at event event event� event event moment event event –
I/I/Em/impimpac//m/i/m


What is sitting next to the food?
I/I/Em/impimpac//m/i/m
badgetgettammertamput food?
On the items on the truck?
I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m
is on the flatbed?
I am the house is this?
100000000 made made out of?
I (I) (I/I/000000000(I)000000
NoRestRestRest RestRestRestrestoppoppeloppendendendElEl
”” event event at event event event� event event moment event event –

disrespectfuldisrespectrespectfulrespectrespectrespectsrespectrespectablerespectrespectably
Stoppoppoppeloppelastoppeladeloppetoppeli
I/I/Em/impimpac//m/i/m
”” event event at event event event� event event moment event event –
I (not) in this photo?
At the time of the event at her neck?
I/I/Em/impimpac//m/i/m
I/I/Em/impimpac//m/i/m

I am in the picture?
I/I/Em/impimpac//m/i/m
Eladisendendendisendiams

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:30<00:00, 15.29s/it, em=0]

Eloppoppoppeloppel(public)Eloppelvisit(s

I/I/Em/impimpac//m/i/m

Iwilliamsiamsundertiamsrespectiamswilliamtiam





### Next Steps:

1. Speed up data loading [open ended perf problem]
2. Checkup the intervention locations for VL-BART
3. Fine-tuned model's performance on eval/test VQA
4. Fine-tuned model manual validation