## VLBart PyReft Integration
This is my preliminary try on integrating VLBart with PyReft.
### Instructions
1. Use Pyvene's peterwz-llava branch and PyReft's peterwz-llava branch.
2. Head to pyreft/examples/vlbart/DoRA/image_video_text_understanding, and install packages with the same version as the requirements.txt there. Note that DoRA requires a much less transformers version.
3. Download dataset according to the instructions in pyreft/examples/vlbart/DoRA/image_video_text_understanding/README.md, specifically, go to the google drive link and download processed CLIP features. Put it in pyreft/examples/vlbart/DoRA/datasets/ In this notebook we only process on VQA features.
4. In image_video_text_understanding/download_backbones.py, change the cache directory to your directory storing the models.
5. Try run image_video_text_understanding/VL-T5/scripts/image/dora.sh to see if your DoRA (VLBart model) is installed successfully.
6. Run this notebook.
### Known Issues
1. Directly plugging the DoRA VLBart model here resulted in a 0.20~ VQA performance.
2. The training is fast in first few steps, then become very slow. I suspect that is related to the data loading cache behavior. Batching the dataset loading process, instead of the lazy data loading we are using now with ReftDataloaderDataset, may be a better option.

In [1]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), 'DoRA/image_video_text_understanding/VL-T5/src')))

In [2]:
import vqa_clip_data

In [3]:
vqa_args = {'RefCOCO_BUTD': False,
 'RefCOCO_GT': False,
 'adam_beta1': 0.9,
 'adam_beta2': 0.999,
 'adam_eps': 1e-06,
 'add_adapter_cross_attn': True,
 'add_layer_norm_after_adapter': False,
 'add_layer_norm_before_adapter': False,
 'additional_visual_embedding_layers': 0,
 'answer_normalize': False,
 'backbone': 'facebook/bart-base',
 'batch_size': 1,
 'caption_cocoonly': True,
 'caption_only': False,
 'classifier': False,
 'clip_grad_norm': 5.0,
 'cls_task': 'tinyimagenet',
 'coco_only': False,
 'comment': '',
 'decoder_prompt_len': 0,
 'deepspeed': None,
 'distributed': False,
 'do_lower_case': False,
 'dora_simple': False,
 'downsample': True,
 'dropout': 0.00,
 'dry': False,
 'efficient_unique_hyper_net': False,
 'encoder_prompt_len': 0,
 'epochs': 20,
 'expand_vis_embedding': False,
 'factorized_phm': True,
 'feat_dim': 2048,
 'feature_type': 'RN101', # RN101
 'fp16': False,
 'freeze_bn_statistics': False,
 'freeze_ln_statistics': False,
 'from_scratch': False,
 'full_determinism': False,
 'gen_max_length': 20,
 'gpu': 0,
 'gradient_accumulation_steps': 1,
 'ground_upsample': 1,
 'ground_weight': 1,
 'hypercomplex_division': 4,
 'image_size': '(224,224)',
 'individual_vis_layer_norm': True,
 'itm_cocoonly': True,
 'lambda_z': 0.001,
 'load': None,
 'load_lxmert_qa': None,
 'local_rank': 0,
 'log_train_accuracy': False,
 'lora_alpha': 32,
 'lora_dim': 128,
 'lora_settings': True,
 'losses': 'lm,obj,attr,feat',
 'low_rank_rank': 1,
 'lr': 0.01,
 'max_n_boxes': 36,
 'max_text_length': 20,
 'mid_dim': 768,
 'multiGPU': True,
 'multitask_sampling': 'roundrobin',
 'n_boxes': 36,
 'n_ground': 1,
 'n_image_tokens': 4,
 'no_prefix': False,
 'num_beams': 5,
 'num_workers': 4,
 'obj_mask_rate': 0.15,
 'oneddownsample': False,
 'optim': 'adamw',
 'optimizer': 'adamw',
 'oscar_tags': False,
 'output': 'snap/VLBart_multitask/tune+lr1e-2_plzplz2',
 'phm_init_range': 0.01,
 'phm_rank': 1,
 'pos_dim': 4,
 'post_prompt': '',
 'prefix': None,
 'project_name': 'RN101_LMsingle_dora_128_bs300_image224_lora_settings',
 'projected_task_embedding_dim': -1,
 'prompt': 'vqa: ',
 'raw_label': False,
 'reduction_factor': 16,
 'remove_bn_vis_adapter': False,
 'run_name': 'tune+lr1e-2_plzplz2',
 'seed': 9595,
 'share_down_sampler': False,
 'share_up_sampler': False,
 'share_vis_lang_layer_norm': False,
 'shared_phm_rule': True,
 'shared_phm_rule_over_tasks': False,
 'shuffle_boxes': False,
 'single_vqa_prefix': False,
 'sparse_sample': False,
 'submit': False,
 'tasks': 'vqa',
 'test': None,
 'test_answerable': False,
 'test_only': False,
 'testing': False,
 'tokenizer': None,
 'track_z': False,
 'train': 'train',
 'train_topk': -1,
 'unfreeze_batch_norms': False,
 'unfreeze_bias': False,
 'unfreeze_decoder_layer_norms': False,
 'unfreeze_encoder_layer_norms': False,
 'unfreeze_language_model': False,
 'unfreeze_layer_norms': False,
 'unfreeze_lm_head': False,
 'unfreeze_vis_encoder': False,
 'unfreeze_vis_last_layer': False,
 'unique_hyper_net': False,
 'use_adam_for_visual': False,
 'use_adapter': False,
 'use_attn_prefix': False,
 'use_compacter': False,
 'use_data_augmentation': False,
 'use_dora': False,
 'use_hyperformer': False,
 'use_lm_head_adapter': False,
 'use_lora': False,
 'use_lradapter': False,
 'use_separate_optimizer_for_visual': False,
 'use_single_adapter': False,
 'use_single_lora': False,
 'use_single_prompt': False,
 'use_tasks_prompts': True,
 'use_vis_adapter': False,
 'use_vis_layer_norm': True,
 'use_vis_order_embedding': True,
 'use_vision': True,
 'valid': 'valid',
 'valid_batch_size': 1,
 'valid_topk': -1,
 'vis_adapter_type': 'middle-bottleneck',
 'vis_lr': 0.0001,
 'vis_pointer': False,
 'vis_pooling_output': False,
 'vis_reduction_factor': 2,
 'vis_use_transformer': False,
 'vis_weight_decay': 0.01,
 'warmup_ratio': 0.1,
 'weight_decay': 0.01,
 'word_mask_rate': 0.15,
 'world_size': 1}

In [4]:
from types import SimpleNamespace
args = SimpleNamespace(**vqa_args)

In [5]:
train_loaders = []
vqa_train_loader = vqa_clip_data.get_loader(
    args,
    split='karpathy_train', mode='train', batch_size=args.batch_size,
    distributed=args.distributed, gpu=0,
    workers=args.num_workers,
    topk=args.train_topk,
)
train_loaders.append(vqa_train_loader)

Load 605102 data from split(s) karpathy_train.
# Answers: 3129
Data sources:  ['karpathy_train']
Loaded 605102 data from karpathy_train
# all sentences: 605102




In [6]:
from multitask import Trainer
trainer = Trainer(args, vqa_train_loader, None, None, train=True)
# trainer.train()

Building Model at GPU 0
Model Launching at GPU 0
model.encoder.visual_embedding.feat_embedding.0.weight is trainable...
model.encoder.visual_embedding.feat_embedding.0.bias is trainable...
model.encoder.visual_embedding.feat_embedding.1.weight is trainable...
model.encoder.visual_embedding.feat_embedding.1.bias is trainable...
model.encoder.visual_embedding.absolute_vis_pos_embedding.0.weight is trainable...
model.encoder.visual_embedding.absolute_vis_pos_embedding.0.bias is trainable...
model.encoder.visual_embedding.absolute_vis_pos_embedding.1.weight is trainable...
model.encoder.visual_embedding.absolute_vis_pos_embedding.1.bias is trainable...
model.encoder.visual_embedding.img_order_embedding.weight is trainable...
VLBartMultiTask(
  (model): VLBartModel(
    (shared): Embedding(50465, 768)
    (encoder): JointEncoder(
      (embed_tokens): Embedding(50465, 768)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768, padding_idx=1)
      (layers): ModuleList(
        (



In [7]:
from pyreft.dataset import ReftDataset, ReftDataloaderDataset
from pyreft import (
    ReftTrainerForCausalLM, 
    ReftDataCollator,
    LoreftIntervention,
    TaskType,
    ReftConfig,
    get_reft_model,
)
import torch

In [8]:
from transformers import BartTokenizer, TrainingArguments
tokenizer = trainer.tokenizer

In [9]:
class VLBartDataset(ReftDataloaderDataset):
    """
    A ReftClassificationDataset only contains a single text field
    that we tokenize, intervene on a prefix + suffix of, and
    compute subspace settings for. This is intended for classification
    tasks.

    Remember to pass in the input_field and label_field as kwargs.
    """
    def load_dataset(self):
        """Load the dataset (or a portion of it) from HF or a local file."""

        self.task_dataset = self.dataloader.dataset
        self.collate_fn = self.task_dataset.collate_fn
        self.fields_to_pad = ["input_ids", "target_ids"]
        self.pad_mode = "none"

        # select n random examples if specificed
        if self.max_n_example is not None:
            self.task_dataset = torch.utils.data.Subset(self.task_dataset, list(range(self.max_n_example)))

        # save raw_dataset pointer for access raw strings
        self.raw_dataset = self.task_dataset if self.data_split != "train" else None
        return self.task_dataset

    def preprocess(self, kwargs):
        self.input_field = "input_ids"
        self.label_field = "target_ids"

    def tokenize(self, data_item):
        result = {**data_item}
        # result["input_length"] += 1
        # result["target_length"] += 1
        result["instruction"] = tokenizer.decode(result["input_ids"], skip_special_tokens=True)

        # TODO: whether to add "-1"?
        last_position = len(data_item[self.input_field]) 
        return result, last_position

In [10]:
layers = [0]
position = "f1+l1"

In [11]:
train_dataset = VLBartDataset(
    "vqa", 
    tokenizer, data_split="train", 
    dataloader=vqa_train_loader,
    max_n_example=100,
    **{"num_interventions": len(layers), "position": position, 
       "share_weights": True, "test_split": "validation"}
)
eval_dataset = VLBartDataset(
    "vqa", 
    tokenizer, data_split="val", 
    dataloader=vqa_train_loader,
    max_n_example=100,
    **{"num_interventions": len(layers), "position": position, 
       "share_weights": True, "test_split": "validation"}
)

In [12]:
model = trainer.model

In [13]:
print(model.config)

BartConfig {
  "RefCOCO_BUTD": false,
  "RefCOCO_GT": false,
  "_name_or_path": "facebook/bart-base",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "adam_beta1": 0.9,
  "adam_beta2": 0.999,
  "adam_eps": 1e-06,
  "adapter_config": null,
  "add_adapter_cross_attn": true,
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "add_layer_norm_after_adapter": false,
  "add_layer_norm_before_adapter": false,
  "additional_visual_embedding_layers": 0,
  "answer_normalize": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.0,
  "backbone": "facebook/bart-base",
  "batch_size": 1,
  "bos_token_id": 0,
  "caption_cocoonly": true,
  "caption_only": false,
  "classif_dropout": 0.1,
  "classifier": false,
  "classifier_dropout": 0.0,
  "clip_grad_norm": 5.0,
  "cls_task": "tinyimagenet",
  "coco_only": false,
  "comment": "",
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers"

In [14]:
train_dataset.collate_fn

<bound method VQAFineTuneDataset.collate_fn of <vqa_clip_data.VQAFineTuneDataset object at 0x7fcb812c4e20>>

In [15]:
print(vqa_train_loader.dataset[0].keys())
print(vqa_train_loader.dataset[0]["all_answers"])
print(tokenizer.decode(vqa_train_loader.dataset[0]["target_ids"], skip_special_tokens=True))

dict_keys(['args', 'img_id', 'vis_feats', 'boxes', 'question_id', 'sent', 'input_ids', 'input_length', 'is_topk_optimal', 'label', 'answer', 'score', 'all_answers', 'target_ids', 'target_length'])
['net']
net


In [16]:
# from transformers import DataCollatorForSeq2Seq
# data_collator_fn = DataCollatorForSeq2Seq(
#     tokenizer=tokenizer,
#     model=model,
#     label_pad_token_id=-100,
#     padding="longest"
# )
import transformers
def keep_intervention_locations(datum):
    new_data = {}
    new_data["input_ids"] = datum["input_ids"]
    # new_data["instruction"] = datum["instruction"]
    new_data["intervention_locations"] = datum["intervention_locations"]
    new_data["attention_mask"] = datum["attention_mask"]
    # print(new_data["input_ids"].shape, new_data["attention_mask"])
    return new_data

def custom_collate_fn(data):
    collate_fn_1 = train_dataset.collate_fn
    collate_fn_2 = transformers.DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        label_pad_token_id=-100,
        padding="longest"
    )
    # for item in data:
    #     print(item["input_ids"].shape)
    output_1 = collate_fn_1(data)
    custom_data = [keep_intervention_locations(item) for item in data]
    output_2 = collate_fn_2(custom_data)
    output = output_1
    output["intervention_locations"] = output_2["intervention_locations"]
    # print(output["intervention_locations"].shape)
    # print(torch.max(output["intervention_locations"]))
    # Offset image tokens' concatenation
    output["intervention_locations"][:,:,-1] += args.n_boxes
    # output["intervention_locations"] -= 1
    # print(torch.max(output["intervention_locations"]))
    # print(output["intervention_locations"])

    # output["id"] = output_2["id"]
    # output["labels"] = output_2["labels"]
    
    # output["attention_mask"] = output_2["attention_mask"]
    # del output["attention_mask"]

    ids = []
    instructions = []
    for d in data:
        ids.append(d["id"])
        instructions.append(d["instruction"])
    import numpy as np
    output["id"] = np.array(ids)
    output["instruction"] = instructions
    
    output["logits"] = output["labels"]
    output["labels"] = output["target_ids"]
    # output["instruction"] = tokenizer.batch_decode(output["input_ids"], skip_special_tokens=True)
    # print("Output Keys:", output.keys())
    
    # print("Input IDs:", output["input_ids"], tokenizer.batch_decode(output["input_ids"], skip_special_tokens=True))
    # print("Labels:", output["labels"].shape)
    # labels = [[token for token in sequence if token != -100] for sequence in output["labels"].tolist()]
    # print("Labels:", tokenizer.batch_decode(labels, skip_special_tokens=True))
    # print("Question IDs:", output["question_ids"])
    # print("Answers:", output["answers"])
    # print("All answers:", output["all_answers"])
    # print("Scores:", output["scores"])

    return output

data_collator = ReftDataCollator(data_collator=custom_collate_fn)

In [17]:
rank = 1
dropout=0.00


In [18]:
representations = [{
    "layer": l, "component": "block_output",
    "low_rank_dimension": rank,
    "intervention": LoreftIntervention(
        embed_dim=model.config.d_model, low_rank_dimension=rank,
        dropout=dropout, dtype=torch.float32, act_fn=None, device="cuda",
        add_bias=True
    )
} for l in layers]
task_type=TaskType.CAUSAL_LM

reft_config = ReftConfig(representations=representations)
empty_reft_config = ReftConfig(representations=[])

In [19]:
reft_model = get_reft_model(model, reft_config)
empty_reft_model = get_reft_model(model, empty_reft_config)
empty_reft_model.print_trainable_parameters()
reft_model.print_trainable_parameters()

trainable intervention params: 0 || trainable model params: 0
model params: 141,156,864 || trainable%: 0.0
trainable intervention params: 1,537 || trainable model params: 0
model params: 141,156,864 || trainable%: 0.0010888595541482134


In [20]:
training_args = TrainingArguments(
    output_dir="random",
    run_name="random",
    num_train_epochs=100,
    per_device_train_batch_size=args.batch_size,
    per_device_eval_batch_size=args.batch_size,
    gradient_accumulation_steps=1,
    evaluation_strategy="no",
    # evaluation_strategy="epoch",
    save_strategy="no",
    metric_for_best_model=None,
    load_best_model_at_end=False,
    logging_strategy="epoch",
    save_total_limit=1, # for GLUE, it will save 2 at max.
    logging_steps=1,
    learning_rate=1e-3,
    warmup_ratio=0.1,
    optim="adamw_torch",
    weight_decay=0.01,
    # lr_scehuler="none",
    lr_scheduler_type="constant",
    report_to="none",
    use_cpu=False,
    seed=42,
    # until HF supports ReFT, this remains False! :)
    remove_unused_columns=False
)

In [21]:
from pyvene import IntervenableModel
# from overrides import overrides

class MyTrainer(ReftTrainerForCausalLM):
    # @overrides
    def training_step(self, model, batch):
        # print("My trainer step")
        batch = self._prepare_inputs(batch)

        # print("Batch:", batch.keys())
        device = batch['input_ids'].device

        batch = model.model.vis_forward(batch, device)
        task = batch["task"]

        vis_feats = batch['vis_feats']
        input_ids = batch['input_ids']
        vis_pos = batch['boxes']

        lm_labels = batch["target_ids"].to(device)

        inputs = {**batch}
        inputs["return_dict"] = True
        inputs["reduce_loss"] = False
        inputs["vis_inputs"] = (vis_feats, vis_pos)
        # print(inputs.keys())

        with self.compute_loss_context_manager():
            loss = self.compute_loss(model, inputs)

        del inputs
        torch.cuda.empty_cache()

        if self.args.n_gpu > 1:
            loss = loss.mean()  # mean() to average on multi-gpu parallel training

        if self.use_apex:
            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                scaled_loss.backward()
        else:
            self.accelerator.backward(loss)

        return loss.detach() / self.args.gradient_accumulation_steps

    def compute_loss(
        self,
        intervenable: IntervenableModel,
        inputs,
        return_outputs=False
    ):
        
        lm_labels = inputs["target_ids"]
        # print("KEYS:", inputs.keys())
        # print("LABELS:", lm_labels)
        # print("SCORES:", inputs["scores"])
        # print("LOCS:", inputs["intervention_locations"])
        # print("INPUT_IDS:", inputs["input_ids"])
        # print("VIS_INPUTS:", inputs["vis_inputs"][0].shape, inputs["vis_inputs"][1].shape)
        
        _, cf_outputs = intervenable(
            {
                "input_ids": inputs["input_ids"],
                # "attention_mask": inputs["attention_mask"],
                "vis_inputs": inputs["vis_inputs"],
                "task": "vqa",
                
            },
            unit_locations={"sources->base": (
                None,
                inputs["intervention_locations"].permute(1, 0, 2).tolist()
            )},
            labels=inputs["target_ids"],
            subspaces=None,
        )
        # return
        loss = cf_outputs.loss
        # print("CF OUTPUTS:", cf_outputs.keys(), len(cf_outputs["loss"]))
        
        
        lm_mask = (lm_labels != -100).float()
        # print("LM MASK:", lm_mask)
        # print("SCORES:", inputs["scores"])
        B, L = lm_labels.size()

        loss = loss.view(B, L) * lm_mask

        loss = loss.sum(dim=1) / lm_mask.sum(dim=1).clamp(min=1)  # B

        loss = loss * inputs["scores"]

        loss = loss.mean()
        return loss
        


In [22]:
trainer = MyTrainer(
    model=reft_model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    compute_metrics=None,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [23]:
print(reft_model.model)

VLBartMultiTask(
  (model): VLBartModel(
    (shared): Embedding(50465, 768)
    (encoder): JointEncoder(
      (embed_tokens): Embedding(50465, 768)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768, padding_idx=1)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine

In [24]:
trainer.train()

{'loss': 5.2297, 'learning_rate': 0.001, 'epoch': 1.0}
{'loss': 4.2623, 'learning_rate': 0.001, 'epoch': 2.0}
{'loss': 3.377, 'learning_rate': 0.001, 'epoch': 3.0}
{'loss': 2.6658, 'learning_rate': 0.001, 'epoch': 4.0}
{'loss': 2.1588, 'learning_rate': 0.001, 'epoch': 5.0}
{'loss': 1.8549, 'learning_rate': 0.001, 'epoch': 6.0}
{'loss': 1.8008, 'learning_rate': 0.001, 'epoch': 7.0}
{'loss': 1.8097, 'learning_rate': 0.001, 'epoch': 8.0}
{'loss': 1.5434, 'learning_rate': 0.001, 'epoch': 9.0}
{'loss': 1.5932, 'learning_rate': 0.001, 'epoch': 10.0}
{'loss': 1.48, 'learning_rate': 0.001, 'epoch': 11.0}
{'loss': 1.4633, 'learning_rate': 0.001, 'epoch': 12.0}
{'loss': 1.4447, 'learning_rate': 0.001, 'epoch': 13.0}
{'loss': 1.3312, 'learning_rate': 0.001, 'epoch': 14.0}
{'loss': 1.3842, 'learning_rate': 0.001, 'epoch': 15.0}
{'loss': 1.3231, 'learning_rate': 0.001, 'epoch': 16.0}
{'loss': 1.3341, 'learning_rate': 0.001, 'epoch': 17.0}
{'loss': 1.3629, 'learning_rate': 0.001, 'epoch': 18.0}
{'lo

TrainOutput(global_step=10000, training_loss=1.2925294776916505, metrics={'train_runtime': 282.7325, 'train_samples_per_second': 35.369, 'train_steps_per_second': 35.369, 'train_loss': 1.2925294776916505, 'epoch': 100.0})

In [25]:
# tokenizer("tree")

In [26]:
# import pyreft
# reft_model = pyreft.ReftModel.load(
#     "temp-outputs", model
# )
# reft_model.set_device("cuda")

In [27]:
reft_model.model.eval()
for k,v in reft_model.interventions.items():
    _ = v[0].eval()


In [28]:
from compute_metrics import compute_metrics
generations, stats = compute_metrics(
    "vqa", "vqa", reft_model, tokenizer, train_dataset, train_dataset,
    '', 'test', 1, # batch_size
    data_collator,
    split=False, greedy_decoding=True, temperature=1.0, top_p=None, top_k=None
)


  2%|██▏                                                                                                          | 2/100 [00:00<00:04, 19.76it/s, em=0]

yes
vqa: What is this photo taken looking through?  |  yes  |  net
yes
vqa: What position is this man playing?  |  yes  |  catcher


  5%|█████▎                                                                                                     | 5/100 [00:00<00:04, 20.70it/s, em=0.4]

white
vqa: What color is the players shirt?  |  white  |  orange
yes
white


  5%|█████▎                                                                                                   | 5/100 [00:00<00:04, 20.70it/s, em=0.333]

white
vqa: What is the person doing?  |  white  |  skiing


  5%|█████▎                                                                                                   | 5/100 [00:00<00:04, 20.70it/s, em=0.286]

white
vqa: What color is the persons headwear?  |  white  |  red


  8%|████████▍                                                                                                | 8/100 [00:00<00:06, 14.42it/s, em=0.333]

yes
vqa: What is in the person's hand?  |  yes  |  frisbee
yes


 11%|███████████▍                                                                                            | 11/100 [00:00<00:05, 17.21it/s, em=0.417]

yes
vqa: Is the dog looking at a tennis ball or frisbee?  |  yes  |  frisbee
yes
yes


 14%|██████████████▌                                                                                         | 14/100 [00:00<00:04, 19.14it/s, em=0.357]

yes
vqa: What is the white streak?  |  yes  |  snow
yes
vqa: Is the window open?  |  yes  |  no


 17%|█████████████████▋                                                                                      | 17/100 [00:00<00:04, 20.52it/s, em=0.353]

yes
vqa: What color is the toothbrush?  |  yes  |  white
yes
vqa: What is the child doing?  |  yes  |  brushing teeth
yes


 17%|█████████████████▋                                                                                      | 17/100 [00:00<00:04, 20.52it/s, em=0.368]

no
yes
vqa: What is the business man doing in the picture?  |  yes  |  standing


 20%|████████████████████▊                                                                                   | 20/100 [00:01<00:03, 21.59it/s, em=0.409]

yes
vqa: Does his tie pair well with his suit?  |  yes  |  no
no
no


 23%|███████████████████████▉                                                                                | 23/100 [00:01<00:03, 22.16it/s, em=0.375]

no
vqa: Is the man wearing a plain tie?  |  no  |  yes
yes
vqa: Judging from the dress, was this taken in a Latin American country?  |  yes  |  no


 26%|███████████████████████████                                                                             | 26/100 [00:01<00:03, 23.06it/s, em=0.407]

no
vqa: What colors are shown in this picture?  |  no  |  black and white
no
no


 29%|██████████████████████████████▏                                                                         | 29/100 [00:01<00:03, 23.53it/s, em=0.414]

yes
no
vqa: What is this man riding on?  |  no  |  skateboard


 32%|█████████████████████████████████▎                                                                      | 32/100 [00:01<00:02, 24.40it/s, em=0.424]

2
yes
vqa: What color is his hat?  |  yes  |  backwards
no
yes
vqa: What color is the jacket?  |  yes  |  green and black


 35%|████████████████████████████████████▍                                                                   | 35/100 [00:01<00:02, 24.44it/s, em=0.429]

yes
yes
vqa: What is the man riding on?  |  yes  |  motorcycle


 38%|███████████████████████████████████████▌                                                                | 38/100 [00:01<00:02, 24.57it/s, em=0.395]

yes
vqa: What is on the pillow?  |  yes  |  nothing
yes
vqa: How many pieces of furniture which are used for sleeping are featured in this picture  |  yes  |  2
yes
vqa: Are the walls done in a summery color?  |  yes  |  no


 41%|███████████████████████████████████████████                                                              | 41/100 [00:01<00:02, 24.86it/s, em=0.39]

yes
vqa: Is the curtain patterned?  |  yes  |  no
yes
vqa: What is sitting on the bench?  |  yes  |  purse
yes


 44%|█████████████████████████████████████████████▊                                                          | 44/100 [00:01<00:02, 25.02it/s, em=0.364]

yes
vqa: What is this person doing?  |  yes  |  skiing
yes
vqa: How many people are in this image?  |  yes  |  1
yes
vqa: Is there a shadow of a tree in the foreground?  |  yes  |  no


 47%|█████████████████████████████████████████████████▎                                                       | 47/100 [00:02<00:02, 24.96it/s, em=0.34]

yes
vqa: What is the man doing?  |  yes  |  skiing
yes
vqa: What color is the sky?  |  yes  |  gray
yes
vqa: What is the person wearing?  |  yes  |  skis


 50%|████████████████████████████████████████████████████▌                                                    | 50/100 [00:02<00:01, 25.48it/s, em=0.34]

yes
vqa: Did someone forget his luggage in the snow?  |  yes  |  no
blue
vqa: What color is his coat?  |  blue  |  blue and white
yes


 53%|███████████████████████████████████████████████████████                                                 | 53/100 [00:02<00:01, 25.54it/s, em=0.321]

yes
vqa: What is she holding?  |  yes  |  poles
yes
vqa: Is the person wearing a hat?  |  yes  |  no
yes
vqa: What is the dog riding on?  |  yes  |  surfboard


 53%|███████████████████████████████████████████████████████                                                 | 53/100 [00:02<00:01, 25.54it/s, em=0.345]

yes
yes


 56%|██████████████████████████████████████████████████████████▏                                             | 56/100 [00:02<00:01, 24.77it/s, em=0.345]

yes
yes
vqa: What is in the water?  |  yes  |  dog
yes
vqa: What does the green light, on the TV, indicate?  |  yes  |  power


 59%|█████████████████████████████████████████████████████████████▉                                           | 59/100 [00:02<00:01, 24.36it/s, em=0.35]

no
laptop
vqa: What room of the house is this?  |  laptop  |  living room


 62%|████████████████████████████████████████████████████████████████▍                                       | 62/100 [00:02<00:01, 24.11it/s, em=0.333]

laptop
vqa: What is the size of the TV?  |  laptop  |  large
yes
vqa: Is the room messy?  |  yes  |  no
no
vqa: Is this a TV screen?  |  no  |  yes


 65%|███████████████████████████████████████████████████████████████████▌                                    | 65/100 [00:02<00:01, 23.77it/s, em=0.323]

laptop
vqa: How big is the TV?  |  laptop  |  big
laptop
vqa: What companion object to the TV can be seen in the bottom right of the  |  laptop  |  remote


 68%|██████████████████████████████████████████████████████████████████████▋                                 | 68/100 [00:02<00:01, 23.34it/s, em=0.324]

laptop
laptop
vqa: What is above the TV?  |  laptop  |  ceiling
laptop
vqa: What is on the display?  |  laptop  |  website


 68%|██████████████████████████████████████████████████████████████████████▋                                 | 68/100 [00:03<00:01, 23.34it/s, em=0.314]

laptop
vqa: Is there a laptop in the image?  |  laptop  |  yes
no
vqa: Is it a monitor or a screen projection?  |  no  |  monitor


 71%|█████████████████████████████████████████████████████████████████████████▊                              | 71/100 [00:03<00:01, 23.26it/s, em=0.301]

laptop
vqa: What is on the TV screen?  |  laptop  |  computer
laptop
vqa: What is the title of the presentation in the picture?  |  laptop  |  can't see
white
vqa: What color is the bear?  |  white  |  black


 74%|████████████████████████████████████████████████████████████████████████████▉                           | 74/100 [00:03<00:01, 23.63it/s, em=0.307]

yes
yes
vqa: What is this?  |  yes  |  bear


 77%|████████████████████████████████████████████████████████████████████████████████                        | 77/100 [00:03<00:00, 23.94it/s, em=0.316]

yes
yes
vqa: Has the sheep recently been shaved?  |  yes  |  no
yes
vqa: How many sheeps are this?  |  yes  |  3
yes


 80%|███████████████████████████████████████████████████████████████████████████████████▏                    | 80/100 [00:03<00:00, 24.56it/s, em=0.321]

yes
vqa: What is the man playing?  |  yes  |  wii
yes


 83%|███████████████████████████████████████████████████████████████████████████████████████▏                 | 83/100 [00:03<00:00, 24.77it/s, em=0.31]

yes
vqa: What does the man have on his face?  |  yes  |  smile
yes
vqa: What is in front of the giraffes?  |  yes  |  tree
yes
vqa: What do these giraffes have in common?  |  yes  |  eating
yes


 86%|█████████████████████████████████████████████████████████████████████████████████████████▍              | 86/100 [00:03<00:00, 24.60it/s, em=0.326]

yes


 89%|████████████████████████████████████████████████████████████████████████████████████████████▌           | 89/100 [00:03<00:00, 24.64it/s, em=0.308]

yes
vqa: Where is the giraffe?  |  yes  |  zoo
yes
vqa: Is there a zebra?  |  yes  |  no
yes
vqa: What is the giraffe standing behind?  |  yes  |  log
yes
vqa: Is the giraffe eating the tree?  |  yes  |  no
yes
vqa: Are both giraffes standing?  |  yes  |  no


 92%|███████████████████████████████████████████████████████████████████████████████████████████████▋        | 92/100 [00:03<00:00, 25.03it/s, em=0.304]

yes
vqa: Are they at a zoo?  |  yes  |  no


 95%|██████████████████████████████████████████████████████████████████████████████████████████████████▊     | 95/100 [00:04<00:00, 25.23it/s, em=0.309]

no
vqa: What is on the ground next to the giraffe on the right?  |  no  |  log
yes
yes
vqa: Are any of the animals eating?  |  yes  |  1
yes
vqa: Is the giraffe in the shade?  |  yes  |  no
yes


 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉  | 98/100 [00:04<00:00, 25.71it/s, em=0.306]

yes
vqa: How many giraffes are there?  |  yes  |  1


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:04<00:00, 23.51it/s, em=0.3]

yes
vqa: Is there a rock near the giraffe?  |  yes  |  no
yes
vqa: How many animals are in this photo?  |  yes  |  2





In [29]:
# eval_dataset[3]["answer"]

In [30]:
# generations[3]

In [31]:
stats

{'eval/vqa': 0.3}

In [32]:
# reft_model.save('temp-outputs')

### Next Steps:

1. Speed up data loading [open ended perf problem]
2. Checkup the intervention locations for VL-BART
3. Fine-tuned model's performance on eval/test VQA
4. Fine-tuned model manual validation