# Filtering questions from VQA dataset
This notebook aims to filter questions from the VQA dataset corresponding to a specific domain. This is done by domain related keyword search on questions. The keywords are provided in a text file.

## Variables
This section contains the variable used in this notebook, the variables must change to an environment to another so that the program can run.

In [0]:
#this variable represents the path to the imported drive in the notebook, if you don't importe drive, ignore this variable 
drive_home = '/content/gdrive'

#path representing directory of train questions from VQA dataset 
train_questions_dir = '/content/gdrive/My Drive/Colab'
#path to the file containing the keywords to filter the questions
key_words_path = '/content/gdrive/My Drive/Colab/treated_indoors.txt'
#path_where to save the filtered tuples (questions ids , questions and images ids) of train questions
train_save_path = '/content/gdrive/My Drive/Colab/results/VQA/train_VQA_filteredImages.txt'
train_annotation_save_path = '/content/gdrive/My Drive/Colab/results/VQA/annotations.json'
#train_answers_dir is the directory where to put train answers
train_answers_dir = '/content/gdrive/My Drive/Colab'
#path representing the place of questions from VQA dataset (copied from val_download_path and pasted on this path)
val_questions_dir = '/content/gdrive/My Drive/Colab'
#path_where to save the filtered tuples (questions ids , questions and images ids) of val questions
val_save_path = '/content/gdrive/My Drive/Colab/results/VQA/val_VQA_filteredImages.txt'
val_annotation_save_path = '/content/gdrive/My Drive/Colab/results/VQA/val_annotations.json'
val_answers_dir =  '/content/gdrive/My Drive/Colab'


## Imports 

In [0]:
import os
#remove if you don't use drive
from google.colab import drive
import shutil
import json


## Mount drive (remove line if you don't) and download train questions

### mount drive and import train questions

In [0]:
#remove if you don't use drive
drive.mount(drive_home)
#! wget -P '$train_questions_dir' https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip
os.chdir(train_questions_dir)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


### unzip questions

In [0]:
#!unzip -P '$train_questions_dir' v2_Questions_Train_mscoco.zip

## Load train questions
The questions file is a bunch of lines containing the question id, the question and the image id related to the question in the VQA dataset.

In [0]:
train_questions_path = os.path.join(train_questions_dir, 'v2_OpenEnded_mscoco_train2014_questions.json')
f = open(train_questions_path)
q = json.load(f)
questionsIds = [(question['question_id'], question['question'], question['image_id']) for question in q['questions']]
f.close
questionsIds[:10]


[(458752000, 'What is this photo taken looking through?', 458752),
 (458752001, 'What position is this man playing?', 458752),
 (458752002, 'What color is the players shirt?', 458752),
 (458752003, 'Is this man a professional baseball player?', 458752),
 (262146000, 'What color is the snow?', 262146),
 (262146001, 'What is the person doing?', 262146),
 (262146002, 'What color is the persons headwear?', 262146),
 (524291000, "What is in the person's hand?", 524291),
 (524291001, 'Is the dog waiting?', 524291),
 (524291002, 'Is the dog looking at a tennis ball or frisbee?', 524291)]

In [0]:
questions = [x[1] for x in questionsIds]
questions[:10]

['What is this photo taken looking through?',
 'What position is this man playing?',
 'What color is the players shirt?',
 'Is this man a professional baseball player?',
 'What color is the snow?',
 'What is the person doing?',
 'What color is the persons headwear?',
 "What is in the person's hand?",
 'Is the dog waiting?',
 'Is the dog looking at a tennis ball or frisbee?']

## filter questions
Filter the questions by keywords, we get from this methods only the questions containing at least on of the keywords

### filterQuestions function to filter questions by keywords

In [0]:
# filter questions and get the ones containing at least on keyword from key_words
def filterQuestions(questions, key_words):
  def contains(question, key_words):
    if not key_words:
      return False
    else :
      return (key_words[0] in question) or contains(question, key_words[1:])

  return [x for x in questions if contains(x[1], key_words)]



### get images ids with questions that have at least one questions in filtered questions
This is to catch all questions that have relation to indoor, with filterQuestions we got all questions containing keywords, that means that images filtered are indoor images (most of them) and that the other questions that don't have keywords in that images could also be indoor questions.

In [0]:
def get_filtered_images(questions_ids, questions_image):
  return [x for x in questions_image if x[0] in questions_ids]


### treat keywords
Treating keywods such as making them lowercase, removing extra-space ... and the loading them





In [0]:
f = open(key_words_path,'r')
key_words = f.readlines()
f.close
treated_key_words = [x.strip().lower() for x in key_words if x.strip() != ''] 
  
treated_key_words

['teapot',
 'stools',
 'sofa blankets',
 'freezer',
 'salts',
 'sofa',
 'grinder',
 'spring-clip tin',
 'floor pillows',
 'springform pan',
 'chaise',
 'medicine cabinet',
 'refrigerator',
 'window curtains',
 'ottoman',
 'microwave',
 'oven glove',
 'dvd',
 'dispenser',
 'painting',
 'curtain',
 'soap',
 'mirror',
 'end table',
 'kitchen scales',
 'book ends',
 'chandelier',
 'flat iron',
 'table clock',
 'clips',
 'couch',
 'holder',
 'drapes',
 ',recliner',
 'toilet tank',
 'desk',
 'tissue box cover',
 'extension cords',
 'cooker',
 'am & fm receiver',
 'speakers',
 'hamper',
 'deep fryer',
 'dog bed',
 'photos',
 'toilet bowl',
 'bubbles',
 'cellphone',
 'toothbrush holder',
 'bath canister',
 'meat fork',
 'scissors',
 'cosmetic bags',
 'valance',
 'carpet',
 'air conditioner',
 'coffee machine',
 'window shutter',
 'surge protectors',
 'chair',
 'napkin',
 'oven',
 'fruit',
 'ladle',
 'basket',
 'ceiling fan',
 'toothbrush',
 'rug',
 'sieve',
 'coat rack',
 'brushes',
 'room div

### filter questions 
The result of applying the filter to questions is a tuple containing (question_id, question, image_id) where question contains of the keywords in key_words

In [0]:
train_filtred_questions = filterQuestions(questionsIds, treated_key_words)
t = train_filtred_questions
(q_id, q, im) = zip(*train_filtred_questions)
train_filtred_questions =get_filtered_images(q_id,questionsIds)
train_filtred_questions

[(393223001, 'What color is the toothbrush?', 393223),
 (131074003, 'Is the curtain patterned?', 131074),
 (131074004, 'What is sitting on the bench?', 131074),
 (131075002, 'What room of the house is this?', 131075),
 (131075004, 'Is the room messy?', 131075),
 (131075005, 'Is this a TV screen?', 131075),
 (131075007,
  'What companion object to the TV can be seen in the bottom right of the picture?',
  131075),
 (131075011, 'Is there a laptop in the image?', 131075),
 (131075012, 'Is it a monitor or a screen projection?', 131075),
 (131075013, 'What is on the TV screen?', 131075),
 (262172001, 'What kind of room is this?', 262172),
 (524320002, "Do the gentleman's socks match his shoes and belt?", 524320),
 (393251001, 'Is the man wearing glasses?', 393251),
 (262180000, 'What is the fruit?', 262180),
 (262180002,
  'What is the name of the type of person that would make this food?',
  262180),
 (43697001, 'What color is the napkin?', 43697),
 (262184000, 'Is this a toy-sized truck?'

In [0]:
len(list(set([x[2] for x in train_filtred_questions])))

31563

### download annotations

In [0]:
!wget -P '$train_answers_dir' "https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip"
os.chdir(train_answers_dir)
!unzip v2_Annotations_Train_mscoco.zip

--2019-03-01 09:00:12--  https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.100.85
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.100.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21708861 (21M) [application/zip]
Saving to: ‘/content/gdrive/My Drive/Colab/v2_Annotations_Train_mscoco.zip’


2019-03-01 09:00:13 (17.7 MB/s) - ‘/content/gdrive/My Drive/Colab/v2_Annotations_Train_mscoco.zip’ saved [21708861/21708861]

Archive:  v2_Annotations_Train_mscoco.zip
  inflating: v2_mscoco_train2014_annotations.json  


In [0]:
train_answers_path = os.path.join(train_answers_dir, 'v2_mscoco_train2014_annotations.json')
with open(train_answers_path) as f :
  anno = json.load(f)


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [0]:
'''with open(train_save_path, 'r') as f :
  lines= [x.strip() for x in f.readlines() if x.strip() != '']
train_filtred_questions = []
for l in lines : 
  y = l.split(';')
  t = (y[0], y[1], y[2])
  train_filtred_questions.append(t)'''
  
annotations = [x for x in anno['annotations']]
fq_id, fq, f_im_id = zip(*train_filtred_questions)


In [0]:

filtered_annotaitons = [x for x in annotations if str(x['image_id']) in f_im_id]

{'question_type': 'is the', 'multiple_choice_answer': 'yes', 'answers': [{'answer': 'yes', 'answer_confidence': 'yes', 'answer_id': 1}, {'answer': 'no', 'answer_confidence': 'yes', 'answer_id': 2}, {'answer': 'yes', 'answer_confidence': 'yes', 'answer_id': 3}, {'answer': 'no', 'answer_confidence': 'maybe', 'answer_id': 4}, {'answer': 'yes', 'answer_confidence': 'yes', 'answer_id': 5}, {'answer': 'yes', 'answer_confidence': 'maybe', 'answer_id': 6}, {'answer': 'yes', 'answer_confidence': 'yes', 'answer_id': 7}, {'answer': 'yes', 'answer_confidence': 'maybe', 'answer_id': 8}, {'answer': 'yes', 'answer_confidence': 'yes', 'answer_id': 9}, {'answer': 'yes', 'answer_confidence': 'yes', 'answer_id': 10}], 'image_id': 393223, 'answer_type': 'yes/no', 'question_id': 393223000}
{'answer_type': 'other', 'multiple_choice_answer': 'white and purple', 'answers': [{'answer': 'white and purple', 'answer_confidence': 'yes', 'answer_id': 1}, {'answer': 'white', 'answer_confidence': 'yes', 'answer_id': 

In [0]:
with open(train_annotation_save_path, 'w') as f : 
  json.dump(filtered_annotaitons,f)


## save tuples of filtred questions

One line of the saved file is in the format :  question_id,question,image_id





In [0]:
def save(path, filtred_questions):
  f = open(path, 'w')
  for i,x in enumerate(filtred_questions):
    if i != 0 : 
      f.write('\n')
    f.write(str(x[0]) +';'+str(x[1]) +';'+str(x[2]))
  f.close
save(train_save_path,train_filtred_questions )

## Downlaod val question

### downlaod the file

In [0]:
#!wget -P '$val_questions_dir' https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip 

### copy from download path to working path

In [0]:
os.chdir(val_questions_dir)
#!unzip v2_Questions_Val_mscoco.zip 
val_questions_path = os.path.join(val_questions_dir, 'v2_OpenEnded_mscoco_val2014_questions.json')

## Load val questions (tuples)

In [0]:
f = open(val_questions_path)
q = json.load(f)
questionsIdsVal = [(question['question_id'], question['question'], question['image_id']) for question in q['questions']]
f.close

<function TextIOWrapper.close>

## Filter val questions

In [0]:
val_filtred_questions = filterQuestions(questionsIdsVal, treated_key_words)
(q_id, q, im) = zip(*val_filtred_questions)
val_filtred_questions =get_filtered_images(q_id,questionsIdsVal)
val_filtred_questions

[(262162000, 'Is that a folding chair?', 262162),
 (262162004, 'Are these twin mattresses?', 262162),
 (262162006, 'Is this room decorated for the 1970s?', 262162),
 (262162007, 'Are the lights on in this room?', 262162),
 (262162009, 'What is the chair made of?', 262162),
 (262162010, "Is this room in someone's home?", 262162),
 (262162011, 'Which room is this?', 262162),
 (262162013, 'Could this be a hotel room?', 262162),
 (262162016, 'Is there a mirror in the room?', 262162),
 (262162017, 'What kind of room is this?', 262162),
 (262162018, 'How many chairs are in the photo?', 262162),
 (262162019, 'Is the desk cluttered?', 262162),
 (262162022, 'How many frames are on the wall?', 262162),
 (262162024, 'Are there any boxes in the room?', 262162),
 (262162025, 'Could this be a multi-purpose room?', 262162),
 (262162026, 'What animal print does that chair resemble?', 262162),
 (262200002, 'Who is in front of the cake with candles?', 262200),
 (393277005, 'What time is it on the clock?

In [0]:
len(list(set([x[2] for x in train_filtred_questions])))

31563

In [0]:
len(val_filtred_questions)

34732

In [0]:
path = val_save_path
save(path, val_filtred_questions)

### Download annotations

In [0]:
#!wget -P '$train_answers_dir' "https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip"
os.chdir(val_answers_dir)
!unzip v2_Annotations_Val_mscoco.zip

Archive:  v2_Annotations_Val_mscoco.zip
  inflating: v2_mscoco_val2014_annotations.json  


In [0]:
train_answers_path = os.path.join(train_answers_dir, 'v2_mscoco_val2014_annotations.json')
with open(train_answers_path) as f :
  anno = json.load(f)

In [0]:
with open(val_save_path, 'r') as f :
  lines= [x.strip() for x in f.readlines() if x.strip() != '']
val_filtred_questions = []
for l in lines : 
  y = l.split(';')
  t = (y[0], y[1], y[2])
  val_filtred_questions.append(t)

In [0]:
annotations = [x for x in anno['annotations']]
fq_id, fq, f_im_id = zip(*val_filtred_questions)

In [0]:
filtered_annotaitons = [x for x in annotations if str(x['image_id']) in f_im_id]

In [0]:
with open(val_annotation_save_path, 'w') as g: 
  json.dump(filtered_annotaitons,g)