# Vieira on GQA

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vieira-artifact/vieira-artifact-aaai24/blob/main/gqa_main.ipynb)

In this notebook we explore using Vieira to solve the GQA visual question answer dataset.
We are going to run our Vieira program on a small dataset of 500 data-points randomly sampled from the original GQA dataset.
For simplicity of rerunning the dataset, we have prepared a dataset already containing pre-generated query programs using GPT-4.
For a full view of the program, please refer to the other GQA file in our repository.

In [1]:
# Checking python version
!python --version

Python 3.10.12


# Download our mini dataset

The dataset contains 500 samples

In [None]:
# Dataset
!wget https://github.com/vieira-artifact/vieira-artifact-aaai24/releases/download/dataset/gqa_mini_data.zip
!unzip gqa_mini_data.zip

# Download Vieira

In [3]:
# Download and install Vieira
# The default python version is for 3.10, you may change the link according to your python versions.
!wget https://github.com/vieira-artifact/vieira-artifact-aaai24/releases/download/v0.2.2/vieira-0.2.2-cp310-cp310-manylinux_2_31_x86_64.whl
!wget https://github.com/vieira-artifact/vieira-artifact-aaai24/releases/download/v0.2.2/vieira_ext-0.2.2-py3-none-any.whl
!wget https://github.com/vieira-artifact/vieira-artifact-aaai24/releases/download/v0.2.2/vieira_gpu-0.0.1-py3-none-any.whl
!wget https://github.com/vieira-artifact/vieira-artifact-aaai24/releases/download/v0.2.2/vieira_gpt-0.0.1-py3-none-any.whl
!wget https://github.com/vieira-artifact/vieira-artifact-aaai24/releases/download/v0.2.2/vieira_transformers-0.0.1-py3-none-any.whl
!wget https://github.com/vieira-artifact/vieira-artifact-aaai24/releases/download/v0.2.2/vieira_opencv-0.0.1-py3-none-any.whl
!pip install vieira-0.2.2-cp310-cp310-manylinux_2_31_x86_64.whl
!pip install vieira_ext-0.2.2-py3-none-any.whl
!pip install vieira_gpu-0.0.1-py3-none-any.whl
!pip install vieira_gpt-0.0.1-py3-none-any.whl
!pip install vieira_transformers-0.0.1-py3-none-any.whl
!pip install vieira_opencv-0.0.1-py3-none-any.whl

Successfully installed huggingface-hub-0.17.3 safetensors-0.4.0 tokenizers-0.14.1 transformers-4.35.0 vieira-transformers-0.0.1
Processing ./vieira_opencv-0.0.1-py3-none-any.whl
Installing collected packages: vieira-opencv
Successfully installed vieira-opencv-0.0.1


# Import Vieira!

In [1]:
# Import vieira and related plugins
import vieira
import vieira_ext

# Setup Vieira plugins.

In this application, GPU, GPT, Transformers, and OpenCV plugins will be enabled.

In [2]:
# Configure Vieira plugins
import argparse
plugins = vieira_ext.PluginRegistry()

parser = argparse.ArgumentParser()
plugins.setup_argument_parser(parser)
known_args, unknown_args = parser.parse_known_args()
plugins.configure(known_args, unknown_args)

[vieira_openai] `OPENAI_API_KEY` not found, consider setting it in the environment variable


# Code loading the dataset and GPT-4 generated programs.

In [3]:
# Get dataset
import pickle
import io
import tokenize

GQA_PATH = '/content/gqa_mini_data/'

# Used to parse each data point's input program
class Program:
  def __init__(self, prog_str, init_state=None):
    self.prog_str = prog_str
    self.state = init_state if init_state is not None else dict()
    self.instructions = self.prog_str.split('\n')
    self.progs = [parse_step(i) for i in self.instructions]

  def __repr__(self):
    return self.prog_str

def parse_step(step_str, partial=False):
  tokens = list(tokenize.generate_tokens(io.StringIO(step_str).readline))
  output_var = tokens[0].string
  step_name = tokens[2].string
  parsed_result = dict(
    output_var=output_var,
    step_name=step_name)
  if partial:
    return parsed_result

  arg_tokens = [token for token in tokens[4:-3] if token.string not in [',','=']]
  num_tokens = len(arg_tokens) // 2
  args = dict()
  for i in range(num_tokens):
    args[arg_tokens[2*i].string] = arg_tokens[2*i+1].string
  parsed_result['args'] = args
  return parsed_result

def get_dataset():
  with open(GQA_PATH + 'mini_question_no_crop.pkl', 'rb') as f:
    return pickle.load(f)

# The Vieira Program

In [4]:
# Create Vieira context with Vieira code
def create_context():
  ctx = vieira.Context(provenance="topkproofs")
  plugins.load_into_ctx(ctx)
  ctx.set_iter_limit(100)

  ctx.add_program("""
  @owl_vit(output_fields=["bbox-x", "bbox-y", "bbox-w", "bbox-h"], limit=5)
  type find_object(bound img: Tensor, bound object: String, id: u32, x: u32, y: u32, w: u32, h: u32)

  @vilt(top=1)
  type vqa(bound img: Tensor, bound question: String, answer: String)

  @py_eval
  type $py_eval_string(s: String) -> String

  // all args are variable names
  type Expr = IMAGE(String)
            | CROP(String, String)
            | CROP_ABOVE(String, String)
            | CROP_BELOW(String, String)
            | CROP_LEFTOF(String, String)
            | CROP_RIGHTOF(String, String)
            | LOC(String, String)
            | VQA(String, String)
            | COUNT(String)
            | EVAL(String)
            | RESULT(String)

  type process_eval_string(bound id: u32, bound str: String, processed: String)
  rel process_eval_string(n, s, s) = n == 0
  rel process_eval_string(n + 1, str, $string_replace(str_to_eval, $format("{{}}", var), val as String)) = var_value_int(var, val, n) and process_eval_string(n, str, str_to_eval)
  rel process_eval_string(n + 1, str, $string_replace(str_to_eval, $format("{{}}", var), val as String)) = var_value_string(var, val, n) and process_eval_string(n, str, str_to_eval)
  rel process_eval_string(n + 1, str, str_to_eval) = step(n, _, e) and case e is LOC(_, _) and process_eval_string(n, str, str_to_eval)
  rel process_eval_string(n + 1, str, str_to_eval) = var_value_tensor(_, _, n) and process_eval_string(n, str, str_to_eval)

  rel eval_image(e, $load_image(img_path)) = case e is IMAGE(img_path)
  rel eval_image(e, $crop_image(img, x, y, w, h)) = case e is CROP(img_var, box_var) and var_value_tensor(img_var, img, _) and var_value_bbox(box_var, _, x, y, w, h, _)
  rel eval_image(e, $crop_image(img, x, y, w, h, "above")) = case e is CROP_ABOVE(img_var, box_var) and var_value_tensor(img_var, img, _) and var_value_bbox(box_var, _, x, y, w, h, _)
  rel eval_image(e, $crop_image(img, x, y, w, h, "below")) = case e is CROP_BELOW(img_var, box_var) and var_value_tensor(img_var, img, _) and var_value_bbox(box_var, _, x, y, w, h, _)
  rel eval_image(e, $crop_image(img, x, y, w, h, "left")) = case e is CROP_LEFTOF(img_var, box_var) and var_value_tensor(img_var, img, _) and var_value_bbox(box_var, _, x, y, w, h, _)
  rel eval_image(e, $crop_image(img, x, y, w, h, "right")) = case e is CROP_RIGHTOF(img_var, box_var) and var_value_tensor(img_var, img, _) and var_value_bbox(box_var, _, x, y, w, h, _)
  rel eval_bbox(e, id, x, y, w, h) = case e is LOC(img_name, object) and var_value_tensor(img_name, image, _) and find_object(image, object, id, x, y, w, h)

  type eval_string(bound id: u32, bound e: Expr, s: String)
  rel eval_string(id, e, answer) = case e is VQA(img_name, question) and var_value_tensor(img_name, image, _) and vqa(image, question, answer)
  rel eval_string(id, e, $py_eval_string(str_to_eval)) = case e is EVAL(str) and process_eval_string(id, str, str_to_eval)
  rel eval_string(id, e, result) = case e is RESULT(var_name) and var_value_string(var_name, result, _)
  rel eval_string(id, e, result as String) = case e is RESULT(var_name) and var_value_int(var_name, result, _)

  type eval_int(e: Expr, n: usize)
  rel eval_int(e, cnt) = cnt := count(bid: var_value_bbox(box_name, bid, _, _, _, _, _) where e: case e is COUNT(box_name))

  rel var_value_tensor(var_name, image, id) = step(id, var_name, expr) and eval_image(expr, image)
  rel var_value_bbox(var_name, id, x, y, w, h, sid) = step(sid, var_name, expr) and eval_bbox(expr, id, x, y, w, h)
  rel var_value_string(var_name, str, id) = step(id, var_name, expr) and eval_string(id, expr, str)
  rel var_value_int(var_name, val, id) = step(id, var_name, expr) and eval_int(expr, val)

  rel final_result(answer) = var_value_string("FINAL_RESULT", answer, _)

  query final_result
  """)

  ctx.set_non_probabilistic("step")
  return ctx

# Testing Scripts

In [7]:
from tqdm import tqdm

# For parsing program input
ARG_ORDER = {
  "COUNT": ("box",),
  "CROP": ("image", "box"),
  "CROP_ABOVE": ("image", "box"),
  "CROP_BELOW": ("image", "box"),
  "CROP_LEFTOF": ("image", "box"),
  "CROP_RIGHTOF": ("image", "box"),
  "EVAL": ("expr",),
  "LOC": ("image", "object"),
  "RESULT": ("var",),
  "VQA": ("image", "question"),
}

# Accuracy is evaluated with string matching @ best k predictions
def check_prediction(predictions, ground_truth, k):
  return any(ground_truth in pred or pred in ground_truth for pred in predictions[:k])

# Test a single data point
def test_one(context, id, testcase):
  image_path = GQA_PATH + "images/" + testcase["imageId"] + ".jpg"   # FIX IMAGE PATH
  step_facts = [(0, "IMAGE", f'IMAGE("{image_path}")')]

  for i, var_dict in enumerate(testcase["prog"].progs):
    function = var_dict["step_name"]
    if function not in ARG_ORDER:
      continue
    arg_str_list = []
    for arg in ARG_ORDER[function]:
      if arg == "expr":
        arg_str_list.append(var_dict["args"][arg])
      elif arg == "object" or arg == "question":
        arg_str_list.append(var_dict["args"][arg].replace("'", '"'))
      else:
        arg_str_list.append('"' + var_dict["args"][arg] + '"')
    expr_str = function + "(" + ",".join(arg_str_list) + ")"
    step_facts.append((i + 1, var_dict["output_var"], expr_str))

  ctx = context.clone()
  ctx.add_facts("step", step_facts)
  ctx.run()
  result = list(ctx.relation("final_result"))

  if result:
    result.sort(key=lambda x: x[0], reverse=True)
    return [tup[1][0] for tup in result[:5]]
  else:
    return ["no"]

# Running the Experiment!

Please checkout the log to get details of experimental results

In [8]:
# Run experiment on the dataset
context = create_context()

data = get_dataset()
items = tqdm(list(data.items()))
results = {}
correct = {1: 0, 3: 0, 5: 0}
match_substring = lambda s1, s2: s1 in s2 or s2 in s1
total = 0
for id, testcase in items:
  ground_truth = testcase["answer"]
  predictions = test_one(context, id, testcase)
  results[id] = predictions + [ground_truth]

  for k in (1, 3, 5):
    if check_prediction(predictions, ground_truth, k):
      correct[k] += 1
  total += 1

  print(f"ground truth: {ground_truth}, predictions: {predictions}")
  print(f"total: {total}, correct: {correct}")
  print(testcase["prog"].prog_str)

print(total)
print(correct)
print(results)

 62%|██████▏   | 310/501 [48:06<33:54, 10.65s/it]

ground truth: no, predictions: ['yes']
total: 310, correct: {1: 182, 3: 202, 5: 203}
BOX0=LOC(image=IMAGE,object='bird')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='young elephant')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 62%|██████▏   | 311/501 [48:07<24:24,  7.71s/it]

ground truth: no, predictions: ['no']
total: 311, correct: {1: 183, 3: 203, 5: 204}
ANSWER0=VQA(image=IMAGE,question='Is that man running?')
FINAL_RESULT=RESULT(var=ANSWER0)


 62%|██████▏   | 312/501 [51:41<3:39:25, 69.66s/it]

ground truth: no, predictions: ['no', 'yes']
total: 312, correct: {1: 184, 3: 204, 5: 205}
BOX0=LOC(image=IMAGE,object='bicycle')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='person')
IMAGE1=CROP_LEFTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='boxes')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 62%|██████▏   | 313/501 [51:44<2:35:45, 49.71s/it]

ground truth: no, predictions: ['no']
total: 313, correct: {1: 185, 3: 205, 5: 206}
BOX0=LOC(image=IMAGE,object='umbrella')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 63%|██████▎   | 314/501 [51:46<1:50:38, 35.50s/it]

ground truth: no, predictions: ['no']
total: 314, correct: {1: 186, 3: 206, 5: 207}
BOX0=LOC(image=IMAGE,object='lamp')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 63%|██████▎   | 315/501 [51:49<1:19:11, 25.55s/it]

ground truth: right, predictions: ['no']
total: 315, correct: {1: 186, 3: 206, 5: 207}
BOX0=LOC(image=IMAGE,object='tent')
IMAGE0=CROP_NEARBY(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='backpack')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'right' if {ANSWER0} > 0 else 'left'")
FINAL_RESULT=RESULT(var=ANSWER1)


 63%|██████▎   | 316/501 [51:53<59:20, 19.25s/it]  

ground truth: no, predictions: ['no']
total: 316, correct: {1: 187, 3: 207, 5: 208}
BOX0=LOC(image=IMAGE,object='bag')
BOX1=LOC(image=IMAGE,object='woman')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} > 0 and {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 63%|██████▎   | 317/501 [51:57<44:37, 14.55s/it]

ground truth: no, predictions: ['no']
total: 317, correct: {1: 188, 3: 208, 5: 209}
BOX0=LOC(image=IMAGE,object='bag')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 63%|██████▎   | 318/501 [52:02<35:41, 11.70s/it]

ground truth: cabinet, predictions: ['kitchen', 'bed', 'sink', 'couch', 'chair']
total: 318, correct: {1: 188, 3: 208, 5: 209}
BOX0=LOC(image=IMAGE,object='LEFT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What kind of furniture is this?')
FINAL_RESULT=RESULT(var=ANSWER0)


 64%|██████▎   | 319/501 [52:04<26:58,  8.89s/it]

ground truth: no, predictions: ['no']
total: 319, correct: {1: 189, 3: 209, 5: 210}
BOX0=LOC(image=IMAGE,object='MIDDLE')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='man')
IMAGE1=CROP_LEFTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='woman')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 64%|██████▍   | 320/501 [52:19<32:16, 10.70s/it]

ground truth: no, predictions: ['no', 'yes']
total: 320, correct: {1: 190, 3: 210, 5: 211}
BOX0=LOC(image=IMAGE,object='cup')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='book')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 64%|██████▍   | 321/501 [52:25<27:44,  9.24s/it]

ground truth: yes, predictions: ['no']
total: 321, correct: {1: 190, 3: 210, 5: 211}
BOX0=LOC(image=IMAGE,object='flags')
BOX1=LOC(image=IMAGE,object='helmets')
IMAGE0=CROP(image=IMAGE,box=BOX0)
IMAGE1=CROP(image=IMAGE,box=BOX1)
ANSWER0=VQA(image=IMAGE0,question='What color are the flags?')
ANSWER1=VQA(image=IMAGE1,question='What color are the helmets?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} == 'blue' or {ANSWER1} == 'blue' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 64%|██████▍   | 322/501 [52:26<20:00,  6.71s/it]

ground truth: dining table, predictions: ['table']
total: 322, correct: {1: 191, 3: 211, 5: 212}
ANSWER0=VQA(image=IMAGE,question='What is the piece of furniture that that napkin is on called?')
FINAL_RESULT=RESULT(var=ANSWER0)


 64%|██████▍   | 323/501 [52:35<22:18,  7.52s/it]

ground truth: yes, predictions: ['no']
total: 323, correct: {1: 191, 3: 211, 5: 212}
BOX0=LOC(image=IMAGE,object='sky')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the sky?')
ANSWER1=VQA(image=IMAGE0,question='Is the sky light?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} == 'blue' and {ANSWER1} == 'yes' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 65%|██████▍   | 324/501 [52:41<20:57,  7.10s/it]

ground truth: yes, predictions: ['no']
total: 324, correct: {1: 191, 3: 211, 5: 212}
BOX0=LOC(image=IMAGE,object='fence')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='truck')
IMAGE1=CROP(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='Is the truck made of metal?')
FINAL_RESULT=RESULT(var=ANSWER0)


 65%|██████▍   | 325/501 [52:44<16:39,  5.68s/it]

ground truth: top, predictions: ['no']
total: 325, correct: {1: 191, 3: 211, 5: 212}
BOX0=LOC(image=IMAGE,object='TOP')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='glasses')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'top' if {ANSWER0} > 0 else 'bottom'")
FINAL_RESULT=RESULT(var=ANSWER1)


 65%|██████▌   | 326/501 [52:44<12:17,  4.21s/it]

ground truth: metal, predictions: ['metal']
total: 326, correct: {1: 192, 3: 212, 5: 213}
ANSWER0=VQA(image=IMAGE,question='What makes up this fence, metal or wood?')
FINAL_RESULT=RESULT(var=ANSWER0)


 65%|██████▌   | 327/501 [52:47<11:13,  3.87s/it]

ground truth: yes, predictions: ['yes']
total: 327, correct: {1: 193, 3: 213, 5: 214}
BOX0=LOC(image=IMAGE,object='window')
ANSWER0=VQA(image=IMAGE,question='Are the windows large?')
FINAL_RESULT=RESULT(var=ANSWER0)


 65%|██████▌   | 328/501 [52:55<14:18,  4.96s/it]

ground truth: yes, predictions: ['no']
total: 328, correct: {1: 193, 3: 213, 5: 214}
BOX0=LOC(image=IMAGE,object='shirt')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE,object='telephone')
IMAGE1=CROP(image=IMAGE,box=BOX1)
ANSWER0=VQA(image=IMAGE0,question='What color is the shirt?')
ANSWER1=VQA(image=IMAGE1,question='What color is the telephone?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} != {ANSWER1} else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 66%|██████▌   | 329/501 [52:56<10:41,  3.73s/it]

ground truth: yes, predictions: ['yes']
total: 329, correct: {1: 194, 3: 214, 5: 215}
ANSWER0=VQA(image=IMAGE,question='Does the man look happy and wet?')
FINAL_RESULT=RESULT(var=ANSWER0)


 66%|██████▌   | 330/501 [53:00<11:21,  3.99s/it]

ground truth: no, predictions: ['no']
total: 330, correct: {1: 195, 3: 215, 5: 216}
BOX0=LOC(image=IMAGE,object='train')
BOX1=LOC(image=IMAGE,object='clock')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 66%|██████▌   | 331/501 [53:03<10:31,  3.72s/it]

ground truth: no, predictions: ['no']
total: 331, correct: {1: 196, 3: 216, 5: 217}
BOX0=LOC(image=IMAGE,object='hat')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color are the hats?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'black' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 66%|██████▋   | 332/501 [53:05<08:12,  2.91s/it]

ground truth: countertop, predictions: ['plate']
total: 332, correct: {1: 196, 3: 216, 5: 217}
ANSWER0=VQA(image=IMAGE,question='This pizza is on what?')
FINAL_RESULT=RESULT(var=ANSWER0)


 66%|██████▋   | 333/501 [53:08<08:30,  3.04s/it]

ground truth: black, predictions: ['no']
total: 333, correct: {1: 196, 3: 216, 5: 217}
BOX0=LOC(image=IMAGE,object='guy')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the jacket the guy is wearing?')
FINAL_RESULT=RESULT(var=ANSWER0)


 67%|██████▋   | 334/501 [53:09<06:39,  2.39s/it]

ground truth: yes, predictions: ['yes']
total: 334, correct: {1: 197, 3: 217, 5: 218}
ANSWER0=VQA(image=IMAGE,question='Is the dog running?')
FINAL_RESULT=RESULT(var=ANSWER0)


 67%|██████▋   | 335/501 [53:15<09:39,  3.49s/it]

ground truth: no, predictions: ['yes', 'no']
total: 335, correct: {1: 197, 3: 218, 5: 219}
BOX0=LOC(image=IMAGE,object='clock')
BOX1=LOC(image=IMAGE,object='picture')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 67%|██████▋   | 336/501 [53:20<11:15,  4.09s/it]

ground truth: yes, predictions: ['yes']
total: 336, correct: {1: 198, 3: 219, 5: 220}
BOX0=LOC(image=IMAGE,object='grass')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Is the grass bushy?')
FINAL_RESULT=RESULT(var=ANSWER0)


 67%|██████▋   | 337/501 [53:29<15:08,  5.54s/it]

ground truth: yes, predictions: ['no']
total: 337, correct: {1: 198, 3: 219, 5: 220}
BOX0=LOC(image=IMAGE,object='stop sign')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE,object='traffic sign')
IMAGE1=CROP(image=IMAGE,box=BOX1)
ANSWER0=VQA(image=IMAGE0,question='What color is the stop sign?')
ANSWER1=VQA(image=IMAGE1,question='What color is the traffic sign?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} == 'white' or {ANSWER1} == 'white' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 67%|██████▋   | 338/501 [53:35<15:19,  5.64s/it]

ground truth: no, predictions: ['no', 'yes']
total: 338, correct: {1: 199, 3: 220, 5: 221}
BOX0=LOC(image=IMAGE,object='oranges')
BOX1=LOC(image=IMAGE,object='cigars')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 68%|██████▊   | 339/501 [53:41<15:07,  5.60s/it]

ground truth: green, predictions: ['green']
total: 339, correct: {1: 200, 3: 221, 5: 222}
BOX0=LOC(image=IMAGE,object='building')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color does the door of the building have?')
FINAL_RESULT=RESULT(var=ANSWER0)


 68%|██████▊   | 340/501 [53:45<14:14,  5.31s/it]

ground truth: no, predictions: ['no']
total: 340, correct: {1: 201, 3: 222, 5: 223}
BOX0=LOC(image=IMAGE,object='fence')
BOX1=LOC(image=IMAGE,object='boy')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 68%|██████▊   | 341/501 [53:49<12:46,  4.79s/it]

ground truth: yes, predictions: ['no']
total: 341, correct: {1: 201, 3: 222, 5: 223}
BOX0=LOC(image=IMAGE,object='blue')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='hat')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 68%|██████▊   | 342/501 [54:00<17:30,  6.61s/it]

ground truth: blue, predictions: ['silver', 'black']
total: 342, correct: {1: 201, 3: 222, 5: 223}
BOX0=LOC(image=IMAGE,object='woman')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='phone')
IMAGE1=CROP(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='What color is the phone?')
FINAL_RESULT=RESULT(var=ANSWER0)


 68%|██████▊   | 343/501 [54:01<13:18,  5.05s/it]

ground truth: outdoors, predictions: ['outdoors']
total: 343, correct: {1: 202, 3: 223, 5: 224}
ANSWER0=VQA(image=IMAGE,question='Is it indoors or outdoors?')
FINAL_RESULT=RESULT(var=ANSWER0)


 69%|██████▊   | 344/501 [59:33<4:29:58, 103.17s/it]

ground truth: no, predictions: ['no', 'yes']
total: 344, correct: {1: 203, 3: 224, 5: 225}
BOX0=LOC(image=IMAGE,object='child')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='ski')
IMAGE1=CROP_RIGHTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='people')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 69%|██████▉   | 345/501 [59:36<3:09:52, 73.03s/it] 

ground truth: no, predictions: ['no']
total: 345, correct: {1: 204, 3: 225, 5: 226}
BOX0=LOC(image=IMAGE,object='wood window')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 69%|██████▉   | 346/501 [59:39<2:14:37, 52.11s/it]

ground truth: no, predictions: ['no']
total: 346, correct: {1: 205, 3: 226, 5: 227}
BOX0=LOC(image=IMAGE,object='chair')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 69%|██████▉   | 347/501 [59:42<1:35:24, 37.17s/it]

ground truth: yes, predictions: ['yes', 'no']
total: 347, correct: {1: 206, 3: 227, 5: 228}
BOX0=LOC(image=IMAGE,object='cell phone')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 69%|██████▉   | 348/501 [59:44<1:08:07, 26.71s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 348, correct: {1: 206, 3: 228, 5: 229}
BOX0=LOC(image=IMAGE,object='window')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 70%|██████▉   | 349/501 [59:46<49:08, 19.40s/it]  

ground truth: yes, predictions: ['no']
total: 349, correct: {1: 206, 3: 228, 5: 229}
BOX0=LOC(image=IMAGE,object='BOTTOM')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='bag')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 70%|██████▉   | 350/501 [59:47<34:47, 13.82s/it]

ground truth: yes, predictions: ['yes']
total: 350, correct: {1: 207, 3: 229, 5: 230}
ANSWER0=VQA(image=IMAGE,question='Is the woman standing?')
FINAL_RESULT=RESULT(var=ANSWER0)


 70%|███████   | 351/501 [59:53<28:39, 11.46s/it]

ground truth: yes, predictions: ['yes', 'no']
total: 351, correct: {1: 208, 3: 230, 5: 231}
BOX0=LOC(image=IMAGE,object='car')
BOX1=LOC(image=IMAGE,object='window')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 70%|███████   | 352/501 [59:57<23:18,  9.39s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 352, correct: {1: 208, 3: 231, 5: 232}
BOX0=LOC(image=IMAGE,object='small chair')
BOX1=LOC(image=IMAGE,object='baby')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 70%|███████   | 353/501 [1:00:12<27:13, 11.03s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 353, correct: {1: 208, 3: 232, 5: 233}
BOX0=LOC(image=IMAGE,object='meat')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='silver knife')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 71%|███████   | 354/501 [1:00:15<20:37,  8.42s/it]

ground truth: white, predictions: ['no']
total: 354, correct: {1: 208, 3: 232, 5: 233}
BOX0=LOC(image=IMAGE,object='toilet')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the toilet?')
FINAL_RESULT=RESULT(var=ANSWER0)


 71%|███████   | 355/501 [1:00:21<18:42,  7.69s/it]

ground truth: yes, predictions: ['no']
total: 355, correct: {1: 208, 3: 232, 5: 233}
BOX0=LOC(image=IMAGE,object='cup')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='wicker chair')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 71%|███████   | 356/501 [1:00:25<16:18,  6.75s/it]

ground truth: no, predictions: ['no', 'yes']
total: 356, correct: {1: 209, 3: 233, 5: 234}
BOX0=LOC(image=IMAGE,object='skateboard')
BOX1=LOC(image=IMAGE,object='rope')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 71%|███████▏  | 357/501 [1:00:27<12:58,  5.41s/it]

ground truth: top, predictions: ['no']
total: 357, correct: {1: 209, 3: 233, 5: 234}
BOX0=LOC(image=IMAGE,object='TOP')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='bottle')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'top' if {ANSWER0} > 0 else 'bottom'")
FINAL_RESULT=RESULT(var=ANSWER1)


 71%|███████▏  | 358/501 [1:00:30<10:40,  4.48s/it]

ground truth: no, predictions: ['no']
total: 358, correct: {1: 210, 3: 234, 5: 235}
BOX0=LOC(image=IMAGE,object='fence')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 72%|███████▏  | 359/501 [1:00:33<09:46,  4.13s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 359, correct: {1: 210, 3: 235, 5: 236}
BOX0=LOC(image=IMAGE,object='plastic forks')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 72%|███████▏  | 360/501 [1:00:37<09:33,  4.07s/it]

ground truth: no, predictions: ['no']
total: 360, correct: {1: 211, 3: 236, 5: 237}
BOX0=LOC(image=IMAGE,object='surfboard')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the surfboard?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'blue' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 72%|███████▏  | 361/501 [1:00:39<08:18,  3.56s/it]

ground truth: yes, predictions: ['no']
total: 361, correct: {1: 211, 3: 236, 5: 237}
BOX0=LOC(image=IMAGE,object='RIGHT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='fence')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 72%|███████▏  | 362/501 [1:00:42<07:50,  3.38s/it]

ground truth: yes, predictions: ['no']
total: 362, correct: {1: 211, 3: 236, 5: 237}
BOX0=LOC(image=IMAGE,object='shorts')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color are the shorts?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'white' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 72%|███████▏  | 363/501 [1:00:45<07:13,  3.14s/it]

ground truth: no, predictions: ['no']
total: 363, correct: {1: 212, 3: 237, 5: 238}
BOX0=LOC(image=IMAGE,object='donkey')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 73%|███████▎  | 364/501 [1:00:49<07:28,  3.27s/it]

ground truth: no, predictions: ['no']
total: 364, correct: {1: 213, 3: 238, 5: 239}
BOX0=LOC(image=IMAGE,object='car')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 73%|███████▎  | 365/501 [1:00:49<05:45,  2.54s/it]

ground truth: street, predictions: ['bridge']
total: 365, correct: {1: 213, 3: 238, 5: 239}
ANSWER0=VQA(image=IMAGE,question='Where is the bus?')
FINAL_RESULT=RESULT(var=ANSWER0)


 73%|███████▎  | 366/501 [1:00:53<06:31,  2.90s/it]

ground truth: no, predictions: ['no']
total: 366, correct: {1: 214, 3: 239, 5: 240}
BOX0=LOC(image=IMAGE,object='cheese')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the cheese?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'blue' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 73%|███████▎  | 367/501 [1:00:54<05:05,  2.28s/it]

ground truth: picture frame, predictions: ['sign']
total: 367, correct: {1: 214, 3: 239, 5: 240}
ANSWER0=VQA(image=IMAGE,question='What is on the wall?')
FINAL_RESULT=RESULT(var=ANSWER0)


 73%|███████▎  | 368/501 [1:00:59<06:44,  3.04s/it]

ground truth: cell phone, predictions: ['phone', 'laptop', 'guitar']
total: 368, correct: {1: 215, 3: 240, 5: 241}
BOX0=LOC(image=IMAGE,object='chair')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What is the man playing with?')
FINAL_RESULT=RESULT(var=ANSWER0)


 74%|███████▎  | 369/501 [1:01:05<08:38,  3.93s/it]

ground truth: no, predictions: ['yes']
total: 369, correct: {1: 215, 3: 240, 5: 241}
BOX0=LOC(image=IMAGE,object='microwave')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='chandelier')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 74%|███████▍  | 370/501 [1:01:20<16:14,  7.44s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 370, correct: {1: 215, 3: 241, 5: 242}
BOX0=LOC(image=IMAGE,object='woman')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='glasses')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 74%|███████▍  | 371/501 [1:01:34<19:53,  9.18s/it]

ground truth: no, predictions: ['no']
total: 371, correct: {1: 216, 3: 242, 5: 243}
BOX0=LOC(image=IMAGE,object='shirt')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE,object='outfit')
IMAGE1=CROP(image=IMAGE,box=BOX1)
ANSWER0=VQA(image=IMAGE0,question='What color is the shirt?')
ANSWER1=VQA(image=IMAGE1,question='What color is the outfit?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} != {ANSWER1} else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 74%|███████▍  | 372/501 [1:01:37<16:14,  7.55s/it]

ground truth: no, predictions: ['no', 'yes']
total: 372, correct: {1: 217, 3: 243, 5: 244}
BOX0=LOC(image=IMAGE,object='dry-erase board')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 74%|███████▍  | 373/501 [1:02:52<59:12, 27.75s/it]

ground truth: yes, predictions: ['yes', 'no']
total: 373, correct: {1: 218, 3: 244, 5: 245}
BOX0=LOC(image=IMAGE,object='drinks')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='tray')
IMAGE1=CROP(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='Does the tray look wooden?')
FINAL_RESULT=RESULT(var=ANSWER0)


 75%|███████▍  | 374/501 [1:02:59<45:29, 21.49s/it]

ground truth: skis, predictions: ['jacket', 'snowsuit']
total: 374, correct: {1: 218, 3: 244, 5: 245}
BOX0=LOC(image=IMAGE,object='person')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What is the guy wearing?')
FINAL_RESULT=RESULT(var=ANSWER0)


 75%|███████▍  | 375/501 [1:03:02<33:05, 15.76s/it]

ground truth: left, predictions: ['no']
total: 375, correct: {1: 218, 3: 244, 5: 245}
BOX0=LOC(image=IMAGE,object='brown')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='animal')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'right' if {ANSWER0} > 0 else 'left'")
FINAL_RESULT=RESULT(var=ANSWER1)


 75%|███████▌  | 376/501 [1:03:04<24:28, 11.75s/it]

ground truth: no, predictions: ['no']
total: 376, correct: {1: 219, 3: 245, 5: 246}
BOX0=LOC(image=IMAGE,object='door')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color are the doors?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'green' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 75%|███████▌  | 377/501 [1:03:10<20:45, 10.05s/it]

ground truth: no, predictions: ['no']
total: 377, correct: {1: 220, 3: 246, 5: 247}
BOX0=LOC(image=IMAGE,object='bag')
BOX1=LOC(image=IMAGE,object='helmet')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 75%|███████▌  | 378/501 [1:03:20<20:21,  9.93s/it]

ground truth: no, predictions: ['no']
total: 378, correct: {1: 221, 3: 247, 5: 248}
BOX0=LOC(image=IMAGE,object='chair')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='computer mouse')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 76%|███████▌  | 379/501 [1:03:25<17:32,  8.63s/it]

ground truth: no, predictions: ['no']
total: 379, correct: {1: 222, 3: 248, 5: 249}
BOX0=LOC(image=IMAGE,object='sandals')
BOX1=LOC(image=IMAGE,object='boots')
BOX2=MERGE(box=BOX0,box=BOX1)
IMAGE0=CROP(image=IMAGE,box=BOX2)
ANSWER0=VQA(image=IMAGE0,question='What color are the sandals and boots?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} != 'black' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 76%|███████▌  | 380/501 [1:03:30<15:10,  7.53s/it]

ground truth: yes, predictions: ['no']
total: 380, correct: {1: 222, 3: 248, 5: 249}
BOX0=LOC(image=IMAGE,object='bridge')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Is the bridge short?')
ANSWER1=EVAL(expr="'no' if {ANSWER0} == 'no' else 'yes'")
FINAL_RESULT=RESULT(var=ANSWER1)


 76%|███████▌  | 381/501 [1:03:33<12:00,  6.00s/it]

ground truth: yes, predictions: ['yes', 'no']
total: 381, correct: {1: 223, 3: 249, 5: 250}
BOX0=LOC(image=IMAGE,object='white gloves')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 76%|███████▌  | 382/501 [1:03:39<11:55,  6.01s/it]

ground truth: no, predictions: ['no']
total: 382, correct: {1: 224, 3: 250, 5: 251}
BOX0=LOC(image=IMAGE,object='table lamp')
BOX1=LOC(image=IMAGE,object='bowl')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 76%|███████▋  | 383/501 [1:03:42<10:28,  5.32s/it]

ground truth: eye glasses, predictions: ['shirt']
total: 383, correct: {1: 224, 3: 250, 5: 251}
BOX0=LOC(image=IMAGE,object='curtains')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What is the person wearing?')
FINAL_RESULT=RESULT(var=ANSWER0)


 77%|███████▋  | 384/501 [1:03:50<11:25,  5.86s/it]

ground truth: no, predictions: ['no']
total: 384, correct: {1: 225, 3: 251, 5: 252}
BOX0=LOC(image=IMAGE,object='man')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Does the man wear a helmet?')
FINAL_RESULT=RESULT(var=ANSWER0)


 77%|███████▋  | 385/501 [1:04:07<18:19,  9.47s/it]

ground truth: no, predictions: ['yes']
total: 385, correct: {1: 225, 3: 251, 5: 252}
BOX0=LOC(image=IMAGE,object='LEFT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='person')
IMAGE1=CROP_LEFTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='kite')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 77%|███████▋  | 386/501 [1:04:12<15:25,  8.05s/it]

ground truth: no, predictions: ['no']
total: 386, correct: {1: 226, 3: 252, 5: 253}
BOX0=LOC(image=IMAGE,object='microwave')
BOX1=LOC(image=IMAGE,object='refrigerator')
BOX2=MERGE(box=BOX0,box=BOX1)
IMAGE0=CROP(image=IMAGE,box=BOX2)
ANSWER0=VQA(image=IMAGE0,question='What color are the microwaves and refrigerators?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} != 'white' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 77%|███████▋  | 387/501 [1:04:16<12:48,  6.75s/it]

ground truth: yes, predictions: ['no']
total: 387, correct: {1: 226, 3: 252, 5: 253}
BOX0=LOC(image=IMAGE,object='bat')
ANSWER0=VQA(image=IMAGE,question='Are any of the bats small?')
FINAL_RESULT=RESULT(var=ANSWER0)


 77%|███████▋  | 388/501 [1:04:19<10:42,  5.68s/it]

ground truth: no, predictions: ['no']
total: 388, correct: {1: 227, 3: 253, 5: 254}
BOX0=LOC(image=IMAGE,object='window')
IMAGE0=CROP_BEHIND(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='person')
IMAGE1=CROP_RIGHTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='remote control')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 78%|███████▊  | 389/501 [1:04:22<08:48,  4.72s/it]

ground truth: no, predictions: ['no']
total: 389, correct: {1: 228, 3: 254, 5: 255}
BOX0=LOC(image=IMAGE,object='cow')
IMAGE0=CROP_BEHIND(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='fence')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 78%|███████▊  | 390/501 [1:04:27<09:13,  4.98s/it]

ground truth: no, predictions: ['no']
total: 390, correct: {1: 229, 3: 255, 5: 256}
BOX0=LOC(image=IMAGE,object='orange vegetable')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Does the orange vegetable look rotten?')
FINAL_RESULT=RESULT(var=ANSWER0)


 78%|███████▊  | 391/501 [1:04:33<09:50,  5.37s/it]

ground truth: yes, predictions: ['yes', 'no']
total: 391, correct: {1: 230, 3: 256, 5: 257}
BOX0=LOC(image=IMAGE,object='bacon')
BOX1=LOC(image=IMAGE,object='rice')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 78%|███████▊  | 392/501 [1:04:51<16:40,  9.17s/it]

ground truth: yes, predictions: ['yes']
total: 392, correct: {1: 231, 3: 257, 5: 258}
BOX0=LOC(image=IMAGE,object='RIGHT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='snowboard')
IMAGE1=CROP_LEFTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='men')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 78%|███████▊  | 393/501 [1:04:52<11:57,  6.65s/it]

ground truth: field, predictions: ['baseball field']
total: 393, correct: {1: 232, 3: 258, 5: 259}
ANSWER0=VQA(image=IMAGE,question='What place is this?')
FINAL_RESULT=RESULT(var=ANSWER0)


 79%|███████▊  | 394/501 [1:04:58<11:36,  6.51s/it]

ground truth: yes, predictions: ['no']
total: 394, correct: {1: 232, 3: 258, 5: 259}
BOX0=LOC(image=IMAGE,object='sky')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the sky?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'blue' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 79%|███████▉  | 395/501 [1:05:04<10:54,  6.17s/it]

ground truth: no, predictions: ['no']
total: 395, correct: {1: 233, 3: 259, 5: 260}
BOX0=LOC(image=IMAGE,object='spoon')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='chair')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 79%|███████▉  | 396/501 [1:05:21<16:48,  9.61s/it]

ground truth: yes, predictions: ['yes']
total: 396, correct: {1: 234, 3: 260, 5: 261}
BOX0=LOC(image=IMAGE,object='van')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='bus')
IMAGE1=CROP_RIGHTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='mirror')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 79%|███████▉  | 397/501 [1:05:24<12:52,  7.43s/it]

ground truth: no, predictions: ['no']
total: 397, correct: {1: 235, 3: 261, 5: 262}
BOX0=LOC(image=IMAGE,object='fire extinguisher')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 79%|███████▉  | 398/501 [1:05:28<11:12,  6.53s/it]

ground truth: brown, predictions: ['gray']
total: 398, correct: {1: 235, 3: 261, 5: 262}
BOX0=LOC(image=IMAGE,object='skirt')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the skirt?')
FINAL_RESULT=RESULT(var=ANSWER0)


 80%|███████▉  | 399/501 [1:05:31<08:58,  5.28s/it]

ground truth: no, predictions: ['no']
total: 399, correct: {1: 236, 3: 262, 5: 263}
BOX0=LOC(image=IMAGE,object='glasses')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 80%|███████▉  | 400/501 [1:05:36<08:56,  5.31s/it]

ground truth: yes, predictions: ['no']
total: 400, correct: {1: 236, 3: 262, 5: 263}
BOX0=LOC(image=IMAGE,object='pillow')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE,object='soccer ball')
IMAGE1=CROP(image=IMAGE,box=BOX1)
ANSWER0=VQA(image=IMAGE0,question='What color is the pillow?')
ANSWER1=VQA(image=IMAGE1,question='What color is the soccer ball?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} == 'white' or {ANSWER1} == 'white' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 80%|████████  | 401/501 [1:06:57<46:37, 27.98s/it]

ground truth: no, predictions: ['no', 'yes']
total: 401, correct: {1: 237, 3: 263, 5: 264}
BOX0=LOC(image=IMAGE,object='LEFT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='glasses')
IMAGE1=CROP_RIGHTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='plate')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 80%|████████  | 402/501 [1:06:59<33:30, 20.31s/it]

ground truth: yes, predictions: ['no']
total: 402, correct: {1: 237, 3: 263, 5: 264}
BOX0=LOC(image=IMAGE,object='table')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the tablecloth?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'purple' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 80%|████████  | 403/501 [1:07:02<24:25, 14.96s/it]

ground truth: no, predictions: ['no']
total: 403, correct: {1: 238, 3: 264, 5: 265}
BOX0=LOC(image=IMAGE,object='empty bottle')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='monitors')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 81%|████████  | 404/501 [1:07:15<23:10, 14.34s/it]

ground truth: no, predictions: ['no']
total: 404, correct: {1: 239, 3: 265, 5: 266}
BOX0=LOC(image=IMAGE,object='toilet')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE,object='bath tub')
IMAGE1=CROP(image=IMAGE,box=BOX1)
ANSWER0=VQA(image=IMAGE0,question='What color is the toilet?')
ANSWER1=VQA(image=IMAGE1,question='What color is the bath tub?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} != {ANSWER1} else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 81%|████████  | 405/501 [1:07:23<19:58, 12.48s/it]

ground truth: yes, predictions: ['no']
total: 405, correct: {1: 239, 3: 265, 5: 266}
BOX0=LOC(image=IMAGE,object='car')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the car?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'gray' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 81%|████████  | 406/501 [1:07:25<15:01,  9.49s/it]

ground truth: no, predictions: ['no']
total: 406, correct: {1: 240, 3: 266, 5: 267}
BOX0=LOC(image=IMAGE,object='LEFT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='trailer')
IMAGE1=CROP_LEFTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='men')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 81%|████████  | 407/501 [1:07:28<11:35,  7.39s/it]

ground truth: no, predictions: ['no', 'yes']
total: 407, correct: {1: 241, 3: 267, 5: 268}
BOX0=LOC(image=IMAGE,object='almonds')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 81%|████████▏ | 408/501 [1:07:30<09:08,  5.90s/it]

ground truth: white, predictions: ['no']
total: 408, correct: {1: 241, 3: 267, 5: 268}
BOX0=LOC(image=IMAGE,object='kitten')
IMAGE0=CROP_NEXTTO(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='remote control')
ANSWER0=VQA(image=IMAGE0,question='What color is the remote control?')
FINAL_RESULT=RESULT(var=ANSWER0)


 82%|████████▏ | 409/501 [1:07:34<08:17,  5.40s/it]

ground truth: yes, predictions: ['yes']
total: 409, correct: {1: 242, 3: 268, 5: 269}
BOX0=LOC(image=IMAGE,object='round pizza')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Does the pizza look fresh?')
FINAL_RESULT=RESULT(var=ANSWER0)


 82%|████████▏ | 410/501 [1:07:40<08:05,  5.34s/it]

ground truth: no, predictions: ['yes', 'no']
total: 410, correct: {1: 242, 3: 269, 5: 270}
BOX0=LOC(image=IMAGE,object='glasses')
BOX1=LOC(image=IMAGE,object='fence')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 82%|████████▏ | 411/501 [1:07:44<07:46,  5.18s/it]

ground truth: no, predictions: ['no']
total: 411, correct: {1: 243, 3: 270, 5: 271}
BOX0=LOC(image=IMAGE,object='shelves')
BOX1=LOC(image=IMAGE,object='mirrors')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 82%|████████▏ | 412/501 [1:07:45<05:47,  3.90s/it]

ground truth: boy, predictions: ['boy']
total: 412, correct: {1: 244, 3: 271, 5: 272}
ANSWER0=VQA(image=IMAGE,question='Who eats the pizza?')
FINAL_RESULT=RESULT(var=ANSWER0)


 82%|████████▏ | 413/501 [1:07:53<07:18,  4.98s/it]

ground truth: no, predictions: ['no']
total: 413, correct: {1: 245, 3: 272, 5: 273}
BOX0=LOC(image=IMAGE,object='man shorts')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color are the man shorts?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'black' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 83%|████████▎ | 414/501 [1:08:04<09:43,  6.71s/it]

ground truth: yes, predictions: ['no']
total: 414, correct: {1: 245, 3: 272, 5: 273}
BOX0=LOC(image=IMAGE,object='TOP')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='red car')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 83%|████████▎ | 415/501 [1:08:11<09:44,  6.80s/it]

ground truth: yes, predictions: ['yes', 'no']
total: 415, correct: {1: 246, 3: 273, 5: 274}
BOX0=LOC(image=IMAGE,object='brown animal')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='man')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 83%|████████▎ | 416/501 [1:08:13<07:56,  5.60s/it]

ground truth: no, predictions: ['no']
total: 416, correct: {1: 247, 3: 274, 5: 275}
BOX0=LOC(image=IMAGE,object='glasses')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Are the glasses made out of wire?')
FINAL_RESULT=RESULT(var=ANSWER0)


 83%|████████▎ | 417/501 [1:08:27<11:03,  7.90s/it]

ground truth: no, predictions: ['no']
total: 417, correct: {1: 248, 3: 275, 5: 276}
BOX0=LOC(image=IMAGE,object='man')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='tshirt')
IMAGE1=CROP_RIGHTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='chair')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 83%|████████▎ | 418/501 [1:08:31<09:35,  6.94s/it]

ground truth: no, predictions: ['no']
total: 418, correct: {1: 249, 3: 276, 5: 277}
BOX0=LOC(image=IMAGE,object='cat')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Is the cat large?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'no' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 84%|████████▎ | 419/501 [1:08:47<13:09,  9.63s/it]

ground truth: right, predictions: ['left']
total: 419, correct: {1: 249, 3: 276, 5: 277}
BOX0=LOC(image=IMAGE,object='woman wearing apron')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='spoon')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'right' if {ANSWER0} > 0 else 'left'")
FINAL_RESULT=RESULT(var=ANSWER1)


 84%|████████▍ | 420/501 [1:08:50<10:19,  7.65s/it]

ground truth: man, predictions: ['girl']
total: 420, correct: {1: 249, 3: 276, 5: 277}
BOX0=LOC(image=IMAGE,object='hat')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Who is wearing the hat?')
FINAL_RESULT=RESULT(var=ANSWER0)


 84%|████████▍ | 421/501 [1:08:53<08:08,  6.10s/it]

ground truth: no, predictions: ['no']
total: 421, correct: {1: 250, 3: 277, 5: 278}
BOX0=LOC(image=IMAGE,object='tools')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='man')
ANSWER0=VQA(image=IMAGE0,question='Does the man appear to be walking?')
FINAL_RESULT=RESULT(var=ANSWER0)


 84%|████████▍ | 422/501 [1:08:55<06:37,  5.03s/it]

ground truth: no, predictions: ['no']
total: 422, correct: {1: 251, 3: 278, 5: 279}
BOX0=LOC(image=IMAGE,object='teddy bear')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the teddy bear?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'white' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 84%|████████▍ | 423/501 [1:08:59<06:04,  4.67s/it]

ground truth: cabinet, predictions: ['no']
total: 423, correct: {1: 251, 3: 278, 5: 279}
BOX0=LOC(image=IMAGE,object='table')
IMAGE0=CROP_BEHIND(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='cheese cube')
IMAGE1=CROP_ABOVE(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='What is the piece of furniture?')
FINAL_RESULT=RESULT(var=ANSWER0)


 85%|████████▍ | 424/501 [1:09:04<06:07,  4.77s/it]

ground truth: yes, predictions: ['no']
total: 424, correct: {1: 251, 3: 278, 5: 279}
BOX0=LOC(image=IMAGE,object='RIGHT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='utensils')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 85%|████████▍ | 425/501 [1:09:23<11:35,  9.15s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 425, correct: {1: 251, 3: 279, 5: 280}
BOX0=LOC(image=IMAGE,object='dog')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='boy')
ANSWER0=VQA(image=IMAGE0,question='Is the boy standing?')
FINAL_RESULT=RESULT(var=ANSWER0)


 85%|████████▌ | 426/501 [1:09:30<10:22,  8.30s/it]

ground truth: left, predictions: ['left']
total: 426, correct: {1: 252, 3: 280, 5: 281}
BOX0=LOC(image=IMAGE,object='RIGHT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='pot')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'right' if {ANSWER0} > 0 else 'left'")
FINAL_RESULT=RESULT(var=ANSWER1)


 85%|████████▌ | 427/501 [1:09:31<07:27,  6.05s/it]

ground truth: no, predictions: ['no']
total: 427, correct: {1: 253, 3: 281, 5: 282}
ANSWER0=VQA(image=IMAGE,question='Is it indoors?')
FINAL_RESULT=RESULT(var=ANSWER0)


 85%|████████▌ | 428/501 [1:09:35<06:54,  5.68s/it]

ground truth: no, predictions: ['no']
total: 428, correct: {1: 254, 3: 282, 5: 283}
BOX0=LOC(image=IMAGE,object='car')
BOX1=LOC(image=IMAGE,object='window')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} > 0 and {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 86%|████████▌ | 429/501 [1:09:38<05:43,  4.77s/it]

ground truth: yes, predictions: ['no']
total: 429, correct: {1: 254, 3: 282, 5: 283}
BOX0=LOC(image=IMAGE,object='pepper')
IMAGE0=CROP_NEXTTO(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='vegetable')
IMAGE1=CROP_LEFTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='carrots')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 86%|████████▌ | 430/501 [1:09:44<06:06,  5.16s/it]

ground truth: yes, predictions: ['yes', 'no']
total: 430, correct: {1: 255, 3: 283, 5: 284}
BOX0=LOC(image=IMAGE,object='dumpsters')
BOX1=LOC(image=IMAGE,object='umbrellas')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 86%|████████▌ | 431/501 [1:09:49<05:50,  5.01s/it]

ground truth: yes, predictions: ['no']
total: 431, correct: {1: 255, 3: 283, 5: 284}
BOX0=LOC(image=IMAGE,object='egg')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='green asparagus')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 86%|████████▌ | 432/501 [1:09:55<06:04,  5.28s/it]

ground truth: no, predictions: ['no']
total: 432, correct: {1: 256, 3: 284, 5: 285}
BOX0=LOC(image=IMAGE,object='umpire')
BOX1=LOC(image=IMAGE,object='catcher')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 86%|████████▋ | 433/501 [1:10:08<08:49,  7.79s/it]

ground truth: right, predictions: ['right', 'left']
total: 433, correct: {1: 257, 3: 285, 5: 286}
BOX0=LOC(image=IMAGE,object='man')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='tennis racket')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'right' if {ANSWER0} > 0 else 'left'")
FINAL_RESULT=RESULT(var=ANSWER1)


 87%|████████▋ | 434/501 [1:10:14<08:07,  7.28s/it]

ground truth: yes, predictions: ['no']
total: 434, correct: {1: 257, 3: 285, 5: 286}
BOX0=LOC(image=IMAGE,object='window')
BOX1=LOC(image=IMAGE,object='door')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 87%|████████▋ | 435/501 [1:10:30<10:41,  9.72s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 435, correct: {1: 257, 3: 286, 5: 287}
BOX0=LOC(image=IMAGE,object='street')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='car')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 87%|████████▋ | 436/501 [1:10:36<09:17,  8.57s/it]

ground truth: large, predictions: ['small']
total: 436, correct: {1: 257, 3: 286, 5: 287}
BOX0=LOC(image=IMAGE,object='plate')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Do you think the plate is large or small?')
FINAL_RESULT=RESULT(var=ANSWER0)


 87%|████████▋ | 437/501 [1:10:40<07:51,  7.37s/it]

ground truth: yes, predictions: ['no']
total: 437, correct: {1: 257, 3: 286, 5: 287}
BOX0=LOC(image=IMAGE,object='truck')
ANSWER0=VQA(image=IMAGE,question='What color are the trucks?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} != 'orange' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 87%|████████▋ | 438/501 [1:10:45<06:54,  6.59s/it]

ground truth: no, predictions: ['no']
total: 438, correct: {1: 258, 3: 287, 5: 288}
BOX0=LOC(image=IMAGE,object='chess pieces')
BOX1=LOC(image=IMAGE,object='placemats')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 88%|████████▊ | 439/501 [1:10:48<05:38,  5.46s/it]

ground truth: no, predictions: ['no']
total: 439, correct: {1: 259, 3: 288, 5: 289}
BOX0=LOC(image=IMAGE,object='fork')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Does this fork look clean?')
FINAL_RESULT=RESULT(var=ANSWER0)


 88%|████████▊ | 440/501 [1:10:51<04:53,  4.81s/it]

ground truth: yes, predictions: ['no']
total: 440, correct: {1: 259, 3: 288, 5: 289}
BOX0=LOC(image=IMAGE,object='surfboard')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the surfboard?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'white' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 88%|████████▊ | 441/501 [1:10:52<03:44,  3.73s/it]

ground truth: dresser, predictions: ['dresser']
total: 441, correct: {1: 260, 3: 289, 5: 290}
ANSWER0=VQA(image=IMAGE,question='Which kind of furniture is it?')
FINAL_RESULT=RESULT(var=ANSWER0)


 88%|████████▊ | 442/501 [1:10:58<04:06,  4.18s/it]

ground truth: no, predictions: ['yes', 'no']
total: 442, correct: {1: 260, 3: 290, 5: 291}
BOX0=LOC(image=IMAGE,object='pliers')
BOX1=LOC(image=IMAGE,object='car')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 88%|████████▊ | 443/501 [1:11:11<06:34,  6.80s/it]

ground truth: yes, predictions: ['no']
total: 443, correct: {1: 260, 3: 290, 5: 291}
BOX0=LOC(image=IMAGE,object='person')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='clothes')
IMAGE1=CROP_RIGHTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='surfboard')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 89%|████████▊ | 444/501 [1:11:13<05:13,  5.50s/it]

ground truth: no, predictions: ['no']
total: 444, correct: {1: 261, 3: 291, 5: 292}
BOX0=LOC(image=IMAGE,object='fence')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 89%|████████▉ | 445/501 [1:11:15<04:15,  4.57s/it]

ground truth: no, predictions: ['no']
total: 445, correct: {1: 262, 3: 292, 5: 293}
BOX0=LOC(image=IMAGE,object='DOWN')
IMAGE0=CROP_NOT(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='surfboard')
IMAGE1=CROP(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='Is the surfboard short?')
ANSWER1=VQA(image=IMAGE1,question='What color is the surfboard?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} == 'yes' and {ANSWER1} == 'white' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 89%|████████▉ | 446/501 [1:11:18<03:37,  3.95s/it]

ground truth: no, predictions: ['no']
total: 446, correct: {1: 263, 3: 293, 5: 294}
BOX0=LOC(image=IMAGE,object='sand')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the sand?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} != 'tan' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 89%|████████▉ | 447/501 [1:11:22<03:41,  4.11s/it]

ground truth: yes, predictions: ['no']
total: 447, correct: {1: 263, 3: 293, 5: 294}
BOX0=LOC(image=IMAGE,object='pants')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color are the pants?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'beige' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 89%|████████▉ | 448/501 [1:11:25<03:11,  3.62s/it]

ground truth: no, predictions: ['no']
total: 448, correct: {1: 264, 3: 294, 5: 295}
BOX0=LOC(image=IMAGE,object='can')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the can?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'silver' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 90%|████████▉ | 449/501 [1:11:30<03:24,  3.94s/it]

ground truth: no, predictions: ['yes', 'no']
total: 449, correct: {1: 264, 3: 295, 5: 296}
BOX0=LOC(image=IMAGE,object='helmet')
BOX1=LOC(image=IMAGE,object='fence')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 90%|████████▉ | 450/501 [1:11:32<02:58,  3.49s/it]

ground truth: small, predictions: ['no']
total: 450, correct: {1: 264, 3: 295, 5: 296}
BOX0=LOC(image=IMAGE,object='apple')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='orange')
ANSWER0=VQA(image=IMAGE0,question='Is the orange small or large?')
FINAL_RESULT=RESULT(var=ANSWER0)


 90%|█████████ | 451/501 [1:14:23<44:52, 53.85s/it]

ground truth: no, predictions: ['yes']
total: 451, correct: {1: 264, 3: 295, 5: 296}
BOX0=LOC(image=IMAGE,object='RIGHT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='man')
IMAGE1=CROP_RIGHTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='car')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 90%|█████████ | 452/501 [1:14:25<31:07, 38.12s/it]

ground truth: no, predictions: ['no']
total: 452, correct: {1: 265, 3: 296, 5: 297}
ANSWER0=VQA(image=IMAGE,question='Is the door open?')
FINAL_RESULT=RESULT(var=ANSWER0)


 90%|█████████ | 453/501 [1:14:54<28:24, 35.52s/it]

ground truth: red, predictions: ['red white and blue', 'yellow', 'red', 'orange', 'blue']
total: 453, correct: {1: 266, 3: 297, 5: 298}
BOX0=LOC(image=IMAGE,object='American flag')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='American flag')
IMAGE1=CROP(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='What color is the American flag?')
FINAL_RESULT=RESULT(var=ANSWER0)


 91%|█████████ | 454/501 [1:15:12<23:37, 30.16s/it]

ground truth: no, predictions: ['yes']
total: 454, correct: {1: 266, 3: 297, 5: 298}
BOX0=LOC(image=IMAGE,object='dugout')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='player')
IMAGE1=CROP_LEFTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='balls')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 91%|█████████ | 455/501 [1:15:14<16:44, 21.85s/it]

ground truth: no, predictions: ['no']
total: 455, correct: {1: 267, 3: 298, 5: 299}
BOX0=LOC(image=IMAGE,object='speaker')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='Wii')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 91%|█████████ | 456/501 [1:15:25<13:56, 18.59s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 456, correct: {1: 267, 3: 299, 5: 300}
BOX0=LOC(image=IMAGE,object='man')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='carriage')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 91%|█████████ | 457/501 [1:16:15<20:29, 27.95s/it]

ground truth: yes, predictions: ['yes', 'no']
total: 457, correct: {1: 268, 3: 300, 5: 301}
BOX0=LOC(image=IMAGE,object='ground')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='kite')
IMAGE1=CROP(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='Does the kite look colorful?')
FINAL_RESULT=RESULT(var=ANSWER0)


 91%|█████████▏| 458/501 [1:16:20<15:00, 20.95s/it]

ground truth: no, predictions: ['no']
total: 458, correct: {1: 269, 3: 301, 5: 302}
BOX0=LOC(image=IMAGE,object='monkey')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the monkey?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} != 'purple' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 92%|█████████▏| 459/501 [1:16:36<13:38, 19.49s/it]

ground truth: no, predictions: ['no', 'yes']
total: 459, correct: {1: 270, 3: 302, 5: 303}
BOX0=LOC(image=IMAGE,object='doughnuts')
IMAGE0=CROP_ABOVE(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='can')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 92%|█████████▏| 460/501 [1:16:41<10:17, 15.05s/it]

ground truth: no, predictions: ['no', 'yes']
total: 460, correct: {1: 271, 3: 303, 5: 304}
BOX0=LOC(image=IMAGE,object='chair')
BOX1=LOC(image=IMAGE,object='shelf')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 92%|█████████▏| 461/501 [1:16:45<07:58, 11.96s/it]

ground truth: yes, predictions: ['no']
total: 461, correct: {1: 271, 3: 303, 5: 304}
BOX0=LOC(image=IMAGE,object='clock')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='woman')
IMAGE1=CROP_BEHIND(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='man')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 92%|█████████▏| 462/501 [1:16:51<06:37, 10.20s/it]

ground truth: no, predictions: ['no']
total: 462, correct: {1: 272, 3: 304, 5: 305}
BOX0=LOC(image=IMAGE,object='bowl')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='chair')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 92%|█████████▏| 463/501 [1:16:55<05:15,  8.31s/it]

ground truth: no, predictions: ['no']
total: 463, correct: {1: 273, 3: 305, 5: 306}
BOX0=LOC(image=IMAGE,object='bus')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the bus?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'green' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 93%|█████████▎| 464/501 [1:16:59<04:19,  7.01s/it]

ground truth: no, predictions: ['no']
total: 464, correct: {1: 274, 3: 306, 5: 307}
BOX0=LOC(image=IMAGE,object='field')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the field?')
ANSWER1=VQA(image=IMAGE0,question='Is the field lush?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} == 'brown' and {ANSWER1} == 'yes' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 93%|█████████▎| 465/501 [1:17:29<08:14, 13.75s/it]

ground truth: no, predictions: ['no', 'yes']
total: 465, correct: {1: 275, 3: 307, 5: 308}
BOX0=LOC(image=IMAGE,object='man')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='woman')
BOX2=LOC(image=IMAGE0,object='glasses')
ANSWER0=COUNT(box=BOX1)
ANSWER1=COUNT(box=BOX2)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} > 0 and {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 93%|█████████▎| 466/501 [1:17:32<06:11, 10.62s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 466, correct: {1: 275, 3: 308, 5: 309}
BOX0=LOC(image=IMAGE,object='bus')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 93%|█████████▎| 467/501 [1:17:37<05:01,  8.87s/it]

ground truth: no, predictions: ['yes', 'no']
total: 467, correct: {1: 275, 3: 309, 5: 310}
BOX0=LOC(image=IMAGE,object='surfboard')
BOX1=LOC(image=IMAGE,object='tray')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 93%|█████████▎| 468/501 [1:17:39<03:48,  6.93s/it]

ground truth: giraffe, predictions: ['no']
total: 468, correct: {1: 275, 3: 309, 5: 310}
BOX0=LOC(image=IMAGE,object='stroller')
IMAGE0=CROP_BEHIND(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What animal is that?')
FINAL_RESULT=RESULT(var=ANSWER0)


 94%|█████████▎| 469/501 [1:17:42<03:02,  5.71s/it]

ground truth: left, predictions: ['no']
total: 469, correct: {1: 275, 3: 309, 5: 310}
BOX0=LOC(image=IMAGE,object='woman')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='On which side is the woman?')
FINAL_RESULT=RESULT(var=ANSWER0)


 94%|█████████▍| 470/501 [1:17:59<04:37,  8.95s/it]

ground truth: no, predictions: ['no', 'yes']
total: 470, correct: {1: 276, 3: 310, 5: 311}
BOX0=LOC(image=IMAGE,object='young person')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='man')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 94%|█████████▍| 471/501 [1:18:01<03:31,  7.04s/it]

ground truth: no, predictions: ['no']
total: 471, correct: {1: 277, 3: 311, 5: 312}
BOX0=LOC(image=IMAGE,object='cat')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 94%|█████████▍| 472/501 [1:18:06<03:04,  6.35s/it]

ground truth: no, predictions: ['no']
total: 472, correct: {1: 278, 3: 312, 5: 313}
BOX0=LOC(image=IMAGE,object='hand soap')
BOX1=LOC(image=IMAGE,object='garland')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 94%|█████████▍| 473/501 [1:18:07<02:11,  4.70s/it]

ground truth: table, predictions: ['table']
total: 473, correct: {1: 279, 3: 313, 5: 314}
ANSWER0=VQA(image=IMAGE,question='Is this a cabinet or a table?')
FINAL_RESULT=RESULT(var=ANSWER0)


 95%|█████████▍| 474/501 [1:18:15<02:37,  5.84s/it]

ground truth: right, predictions: ['no']
total: 474, correct: {1: 279, 3: 313, 5: 314}
BOX0=LOC(image=IMAGE,object='horse')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='eating')
IMAGE1=CROP_LEFTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='tan animal')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'left' if {ANSWER0} > 0 else 'right'")
FINAL_RESULT=RESULT(var=ANSWER1)


 95%|█████████▍| 475/501 [1:18:20<02:24,  5.54s/it]

ground truth: no, predictions: ['no']
total: 475, correct: {1: 280, 3: 314, 5: 315}
BOX0=LOC(image=IMAGE,object='snowboard')
BOX1=LOC(image=IMAGE,object='mirror')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 95%|█████████▌| 476/501 [1:18:27<02:29,  5.96s/it]

ground truth: black, predictions: ['black']
total: 476, correct: {1: 281, 3: 315, 5: 316}
BOX0=LOC(image=IMAGE,object='pillow')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='device')
IMAGE1=CROP(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='What color is the device?')
FINAL_RESULT=RESULT(var=ANSWER0)


 95%|█████████▌| 477/501 [1:18:40<03:15,  8.16s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 477, correct: {1: 281, 3: 316, 5: 317}
BOX0=LOC(image=IMAGE,object='food')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='man')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 95%|█████████▌| 478/501 [1:20:01<11:27, 29.89s/it]

ground truth: no, predictions: ['no']
total: 478, correct: {1: 282, 3: 317, 5: 318}
BOX0=LOC(image=IMAGE,object='RIGHT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='shelf')
IMAGE1=CROP(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='bottle')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 96%|█████████▌| 479/501 [1:20:10<08:37, 23.51s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 479, correct: {1: 282, 3: 318, 5: 319}
BOX0=LOC(image=IMAGE,object='cat')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='pillow')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 96%|█████████▌| 480/501 [1:20:14<06:16, 17.93s/it]

ground truth: yes, predictions: ['yes', 'no']
total: 480, correct: {1: 283, 3: 319, 5: 320}
BOX0=LOC(image=IMAGE,object='bottle')
BOX1=LOC(image=IMAGE,object='fork')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 96%|█████████▌| 481/501 [1:20:17<04:25, 13.29s/it]

ground truth: yes, predictions: ['no']
total: 481, correct: {1: 283, 3: 319, 5: 320}
BOX0=LOC(image=IMAGE,object='LEFT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='scooter')
IMAGE1=CROP_RIGHTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='skateboard')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 96%|█████████▌| 482/501 [1:20:20<03:15, 10.31s/it]

ground truth: yes, predictions: ['no']
total: 482, correct: {1: 283, 3: 319, 5: 320}
BOX0=LOC(image=IMAGE,object='pants')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color are the pants?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'light brown' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 96%|█████████▋| 483/501 [1:20:23<02:26,  8.12s/it]

ground truth: no, predictions: ['yes', 'no']
total: 483, correct: {1: 283, 3: 320, 5: 321}
BOX0=LOC(image=IMAGE,object='bike')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 97%|█████████▋| 484/501 [1:21:11<05:38, 19.89s/it]

ground truth: sign, predictions: ['bike', 'bag', 'sign', 'nothing']
total: 484, correct: {1: 283, 3: 321, 5: 322}
BOX0=LOC(image=IMAGE,object='trash bin')
IMAGE0=CROP_LEFTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='bicycle')
IMAGE1=CROP_LEFTOF(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='What is leaning against the bicycle?')
FINAL_RESULT=RESULT(var=ANSWER0)


 97%|█████████▋| 485/501 [1:21:13<03:54, 14.66s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 485, correct: {1: 283, 3: 322, 5: 323}
BOX0=LOC(image=IMAGE,object='chair')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 97%|█████████▋| 486/501 [1:21:22<03:12, 12.85s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 486, correct: {1: 283, 3: 323, 5: 324}
BOX0=LOC(image=IMAGE,object='microwave')
IMAGE0=CROP_ABOVE(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='cabinet')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 97%|█████████▋| 487/501 [1:21:35<03:01, 12.95s/it]

ground truth: yes, predictions: ['no']
total: 487, correct: {1: 283, 3: 323, 5: 324}
BOX0=LOC(image=IMAGE,object='tree')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE,object='truck')
IMAGE1=CROP(image=IMAGE,box=BOX1)
ANSWER0=VQA(image=IMAGE0,question='What color is the tree?')
ANSWER1=VQA(image=IMAGE1,question='What color is the truck?')
ANSWER2=EVAL(expr="'yes' if {ANSWER0} != {ANSWER1} else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 97%|█████████▋| 488/501 [1:21:39<02:13, 10.27s/it]

ground truth: no, predictions: ['yes']
total: 488, correct: {1: 283, 3: 323, 5: 324}
BOX0=LOC(image=IMAGE,object='steak')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='Does the steak look large and brown?')
FINAL_RESULT=RESULT(var=ANSWER0)


 98%|█████████▊| 489/501 [1:21:55<02:23, 11.93s/it]

ground truth: no, predictions: ['no', 'yes']
total: 489, correct: {1: 284, 3: 324, 5: 325}
BOX0=LOC(image=IMAGE,object='man')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='microphone')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 98%|█████████▊| 490/501 [1:21:57<01:40,  9.12s/it]

ground truth: no, predictions: ['no', 'yes']
total: 490, correct: {1: 285, 3: 325, 5: 326}
BOX0=LOC(image=IMAGE,object='bag')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 98%|█████████▊| 491/501 [1:22:01<01:13,  7.37s/it]

ground truth: right, predictions: ['no']
total: 491, correct: {1: 285, 3: 325, 5: 326}
BOX0=LOC(image=IMAGE,object='LEFT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='screen')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'left' if {ANSWER0} > 0 else 'right'")
FINAL_RESULT=RESULT(var=ANSWER1)


 98%|█████████▊| 492/501 [1:22:07<01:03,  7.01s/it]

ground truth: yes, predictions: ['no']
total: 492, correct: {1: 285, 3: 325, 5: 326}
BOX0=LOC(image=IMAGE,object='house')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the house?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} == 'brown' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 98%|█████████▊| 493/501 [1:22:09<00:45,  5.65s/it]

ground truth: no, predictions: ['no']
total: 493, correct: {1: 286, 3: 326, 5: 327}
BOX0=LOC(image=IMAGE,object='bicycle')
ANSWER0=COUNT(box=BOX0)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 99%|█████████▊| 494/501 [1:22:12<00:34,  4.93s/it]

ground truth: yes, predictions: ['no']
total: 494, correct: {1: 286, 3: 326, 5: 327}
BOX0=LOC(image=IMAGE,object='dog')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What color is the dog?')
ANSWER1=EVAL(expr="'yes' if {ANSWER0} != 'brown' else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 99%|█████████▉| 495/501 [1:22:19<00:31,  5.33s/it]

ground truth: no, predictions: ['no', 'yes']
total: 495, correct: {1: 287, 3: 327, 5: 328}
BOX0=LOC(image=IMAGE,object='white desk')
BOX1=LOC(image=IMAGE,object='white chair')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="'yes' if {ANSWER0} + {ANSWER1} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER2)


 99%|█████████▉| 496/501 [1:22:29<00:34,  6.91s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 496, correct: {1: 287, 3: 328, 5: 329}
BOX0=LOC(image=IMAGE,object='table')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='bowls')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 99%|█████████▉| 497/501 [1:24:58<03:18, 49.53s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 497, correct: {1: 287, 3: 329, 5: 330}
BOX0=LOC(image=IMAGE,object='children')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='horse')
IMAGE1=CROP_RIGHTOF(image=IMAGE0,box=BOX1)
BOX2=LOC(image=IMAGE1,object='flag')
ANSWER0=COUNT(box=BOX2)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


 99%|█████████▉| 498/501 [1:25:14<01:57, 39.28s/it]

ground truth: no, predictions: ['no']
total: 498, correct: {1: 288, 3: 330, 5: 331}
BOX0=LOC(image=IMAGE,object='white vehicle')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='bags')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


100%|█████████▉| 499/501 [1:25:30<01:04, 32.35s/it]

ground truth: yes, predictions: ['yes']
total: 499, correct: {1: 289, 3: 331, 5: 332}
BOX0=LOC(image=IMAGE,object='hydrant')
IMAGE0=CROP_RIGHTOF(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='person')
IMAGE1=CROP(image=IMAGE0,box=BOX1)
ANSWER0=VQA(image=IMAGE1,question='Is the person wearing a jacket?')
FINAL_RESULT=RESULT(var=ANSWER0)


100%|█████████▉| 500/501 [1:25:45<00:27, 27.18s/it]

ground truth: yes, predictions: ['no', 'yes']
total: 500, correct: {1: 289, 3: 332, 5: 333}
BOX0=LOC(image=IMAGE,object='LEFT')
IMAGE0=CROP(image=IMAGE,box=BOX0)
BOX1=LOC(image=IMAGE0,object='picture')
ANSWER0=COUNT(box=BOX1)
ANSWER1=EVAL(expr="'yes' if {ANSWER0} > 0 else 'no'")
FINAL_RESULT=RESULT(var=ANSWER1)


100%|██████████| 501/501 [1:25:50<00:00, 10.28s/it]

ground truth: table, predictions: ['table']
total: 501, correct: {1: 290, 3: 333, 5: 334}
BOX0=LOC(image=IMAGE,object='furniture')
IMAGE0=CROP(image=IMAGE,box=BOX0)
ANSWER0=VQA(image=IMAGE0,question='What is this item of furniture called?')
FINAL_RESULT=RESULT(var=ANSWER0)
501
{1: 290, 3: 333, 5: 334}
{'19220961': ['chair', 'bench'], '07267508': ['table', 'bottle'], '15512494': ['no', 'no'], '00464293': ['no', 'yes', 'yes'], '10233633': ['no', 'yes', 'yes'], '1679555': ['no', 'yes'], '13852280': ['no', 'yes', 'no'], '151004785': ['modern', 'antique'], '01248326': ['yes', 'no', 'no'], '12976953': ['no', 'yes', 'no'], '17908046': ['no', 'yes', 'no'], '05712037': ['no', 'yes'], '05673874': ['brown', 'black', 'brown'], '06407231': ['yes', 'no', 'yes'], '12863527': ['no', 'no'], '04587867': ['short', 'long'], '1923285': ['no', 'yes'], '07794099': ['no', 'blue'], '01776009': ['no', 'yes'], '04711033': ['no', 'no'], '14596273': ['yes', 'no', 'yes'], '14892279': ['no', 'no'], '13389301': ['no'




# Performance

Here we report the score under Recall@$k$ metric, where $k = 1, 3, 5$.

In [10]:
# Print result
print("Recall@1:", correct[1] / total)
print("Recall@3:", correct[3] / total)
print("Recall@5:", correct[5] / total)

Recall@1: 0.5788423153692615
Recall@3: 0.6646706586826348
Recall@5: 0.6666666666666666
