# VQA - MED

This notebook demostraits the efforts made for Visual Q&A based on the data set from VQA-Med 2018 contest

## Abstract

The inputs for VQA are:
1. The question text 
2. The image

The question text is being embedded into a feature vector using a pre-traing [globe file](https://nlp.stanford.edu/projects/glove/). 

In a similar manner the image is being processed using a pre trained deep NN (e.g. [VGG](http://qr.ae/TUTEKo) with initial wights of a pretrained [imagenet model](https://en.wikipedia.org/wiki/ImageNet))


## The plan

0. [Preperations and helpers](#Preperations-and-helpers)
1. [Collecting pre processing item](#Collecting-pre-processing-item)
2. [Preprocessing and creating meta data](#Preprocessing-and-creating-meta-data)
3. [Creating the model](#Creating-the-model)
4. [Training the model](#Training-the-model)
5. [Testing the model](#Testing-the-model)


### Preperations and helpers

The following are just helpers & utils imports - feel free to skip...

In [4]:
from parsers.utils import VerboseTimer
from utils.os_utils import File, print_progress

### Collecting pre processing item

###### Download pre trained items & store their location

In [None]:
#TODO: Add down loading for glove file

In [5]:
import os
seq_length =    26
embedding_dim = 300

glove_path =                    os.path.abspath('data/glove.6B.{0}d.txt'.format(embedding_dim))
embedding_matrix_filename =     os.path.abspath('data/ckpts/embeddings_{0}.h5'.format(embedding_dim))
ckpt_model_weights_filename =   os.path.abspath('data/ckpts/model_weights.h5')


In [6]:
import os
# Fail fast...
suffix = "Failing fast:\n"
assert os.path.isfile(glove_path), suffix+"glove file does not exists:\n{0}".format(glove_path)
# assert os.path.isfile(embedding_matrix_filename), suffix+"Embedding matrix file does not exist:\n{0}".format(embedding_matrix_filename)
assert os.path.isfile(ckpt_model_weights_filename), suffix+"glove file does not exists:\n{0}".format(ckpt_model_weights_filename)

print('Validated file locations')

Validated file locations


##### Set locations for pre-training items to-be created

In [7]:
# Pre process results files
data_prepo_meta            = os.path.abspath('data/my_data_prepro.json')
data_prepo_meta_validation = os.path.abspath('data/my_data_prepro_validation.json')
# Location of embediing pre trained matrix
embedding_matrix_filename  = os.path.abspath('data/ckpts/embeddings_{0}.h5'.format(embedding_dim))

# The location to dump models to
vqa_models_folder          = "C:\\Users\\Public\\Documents\\Data\\2018\\vqa_models"



### Preprocessing and creating meta data

We will use this function for creating meta data:

In [9]:
from vqa_logger import logger 
import itertools
import string
from utils.os_utils import File #This is a simplehelper file of mine...

def create_meta(meta_file_location, df):
        logger.debug("Creating meta data ('{0}')".format(meta_file_location))
        def get_unique_words(col):
            single_string = " ".join(df[col])
            exclude = set(string.punctuation)
            s_no_panctuation = ''.join(ch for ch in single_string if ch not in exclude)
            unique_words = set(s_no_panctuation.split(" ")).difference({'',' '})
            print("column {0} had {1} unique words".format(col,len(unique_words)))
            return unique_words

        cols = ['question', 'answer']
        unique_words = set(itertools.chain.from_iterable([get_unique_words(col) for col in cols]))
        print("total unique words: {0}".format(len(unique_words)))

        metadata = {}
        metadata['ix_to_word'] = {str(word): int(i) for i, word in enumerate(unique_words)}
        metadata['ix_to_ans'] = {ans:i for ans, i in enumerate(set(df['answer']))}
        # {int(i):str(word) for i, word in enumerate(unique_words)}

        File.dump_json(metadata,meta_file_location)
        return metadata

And lets create meta data for training & validation sets:

In [10]:
from collections import namedtuple
dbg_file_csv_train = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Train\\VQAMed2018Train-QA.csv'
dbg_file_xls_train = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Train\\VQAMed2018Train-QA_post_pre_process_intermediate.xlsx'#"'C:\\\\Users\\\\avitu\\\\Documents\\\\GitHub\\\\VQA-MED\\\\VQA-MED\\\\Cognitive-LUIS-Windows-master\\\\Sample\\\\VQA.Python\\\\dumped_data\\\\vqa_data.xlsx'
dbg_file_xls_processed_train = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Train\\VQAMed2018Train-QA_post_pre_process.xlsx'
train_embedding_path = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Train\\VQAMed2018Train-images\\embbeded_images.hdf'
images_path_train = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Train\\VQAMed2018Train-images'


dbg_file_csv_validation = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Valid\\VQAMed2018Valid-QA.csv'
dbg_file_xls_validation = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Valid\\VQAMed2018Valid-QA_post_pre_process_intermediate.xlsx'
dbg_file_xls_processed_validation = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Valid\\VQAMed2018Valid-QA_post_pre_process.xlsx'
validation_embedding_path = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Valid\\VQAMed2018Valid-images\\embbeded_images.hdf'
images_path_validation = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Valid\\VQAMed2018Valid-images'


dbg_file_csv_test = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Test\\VQAMed2018Test-QA.csv'
dbg_file_xls_test = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Test\\VQAMed2018Test-QA_post_pre_process_intermediate.xlsx'
dbg_file_xls_processed_test = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Test\\VQAMed2018Test-QA_post_pre_process.xlsx'
test_embedding_path = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Test\\VQAMed2018Test-images\\embbeded_images.hdf'
images_path_test = 'C:\\Users\\Public\\Documents\\Data\\2018\\VQAMed2018Test\\VQAMed2018Test-images'

DataLocations = namedtuple('DataLocations', ['data_tag', 'raw_csv', 'raw_xls', 'processed_xls','images_path'])
train_data = DataLocations('train', dbg_file_csv_train,dbg_file_xls_train,dbg_file_xls_processed_train, images_path_train)
validation_data = DataLocations('validation', dbg_file_csv_validation, dbg_file_xls_validation, dbg_file_xls_processed_validation, images_path_validation)
test_data = DataLocations('test', dbg_file_csv_test, dbg_file_xls_test, dbg_file_xls_processed_test, images_path_test)

Get the data itself, Note the only things required in dataframe are:
1. image_name
2. question
3. answer


In [12]:
from parsers.VQA18 import Vqa18Base
df_train = Vqa18Base.get_instance(train_data.processed_xls).data            
df_val = Vqa18Base.get_instance(validation_data.processed_xls).data
df_train.head(2)


20:58:24,378 matplotlib DEBUG ## $HOME=C:\Users\avitu
20:58:24,380 matplotlib DEBUG ## matplotlib data path c:\local\Anaconda3-4.1.1-Windows-x86_64\envs\conda_env\lib\site-packages\matplotlib\mpl-data
20:58:24,386 matplotlib DEBUG ## loaded rc file c:\local\Anaconda3-4.1.1-Windows-x86_64\envs\conda_env\lib\site-packages\matplotlib\mpl-data\matplotlibrc
20:58:24,388 matplotlib DEBUG ## matplotlib version 2.2.2
20:58:24,389 matplotlib DEBUG ## interactive is False
20:58:24,390 matplotlib DEBUG ## platform is win32


Unnamed: 0,image_name,answer,brain,row_id,neck,tokenized_question,ct,tokenized_answer,mri,tumor,abdomen,liver,question,hematoma
0,rjv03401,lesion at tail of pancreas,False,1,False,what doe MRI show ?,False,tumor at tail pancreas,True,True,False,False,what does mri show?,False
1,AIAN-14-313-g002,in distal pancreas,False,2,False,where doe axial seCTion MRI abdomen show hypoe...,False,distal pancreas,True,False,True,False,where does axial section mri abdomen show hypo...,False


In [16]:
print("----- Creating training meta -----")
meta_train = create_meta(data_prepo_meta, df_train)

print("\n----- Creating validation meta -----")
meta_validation = create_meta(data_prepo_meta, df_val)

meta_train

20:59:51,971 pythonVQA DEBUG ## Creating meta data ('C:\Users\avitu\Documents\GitHub\VQA-MED\VQA-MED\Cognitive-LUIS-Windows-master\Sample\VQA.Python\data\my_data_prepro.json')
20:59:52,89 pythonVQA DEBUG ## Creating meta data ('C:\Users\avitu\Documents\GitHub\VQA-MED\VQA-MED\Cognitive-LUIS-Windows-master\Sample\VQA.Python\data\my_data_prepro.json')


----- Creating training meta -----
column question had 3317 unique words
column answer had 3255 unique words
total unique words: 3578

----- Creating validation meta -----
column question had 399 unique words
column answer had 669 unique words
total unique words: 881


{'ix_to_word': {'senile': 0,
  'peritoneum': 1,
  'opacities': 2,
  'calvarial': 3,
  'alar': 4,
  'facets': 5,
  'damage': 6,
  'subpleural': 7,
  'lad': 8,
  'ethmoidal': 9,
  'intrapulmonary': 10,
  'bladders': 11,
  'signals': 12,
  'aggregates': 13,
  'sellar': 14,
  'lymphadenopathy': 15,
  'renal': 16,
  'fluidfilled': 17,
  'atrophied': 18,
  'plate': 19,
  'form': 20,
  'sphenoidal': 21,
  'costo': 22,
  'tmuor': 23,
  'regimen': 24,
  'contents': 25,
  'perihepatic': 26,
  'where': 27,
  'pathological': 28,
  'underneath': 29,
  'eccentric': 30,
  'sarcoma': 31,
  'ovary': 32,
  '6': 33,
  'arch': 34,
  'myelitis': 35,
  'clinically': 36,
  '35': 37,
  'periventricular': 38,
  'shape': 39,
  'marginal': 40,
  'detect': 41,
  'significant': 42,
  'tip': 43,
  'initial': 44,
  'margin': 45,
  'maxilla': 46,
  's5': 47,
  'sclerosing': 48,
  'pericardial': 49,
  'membrane': 50,
  'outgrowth': 51,
  'wing': 52,
  'comminution': 53,
  'spectroscopy': 54,
  'remote': 55,
  'anorect

### Creating the model

#### The functions the gets the model:

##### Get Embedding:

In [22]:
import numpy as np
import random
import h5py
def prepare_embeddings(metadata):
    embedding_filename = embedding_matrix_filename
    num_words = len(metadata['ix_to_word'].keys())
    dim_embedding = embedding_dim



    logger.debug("Embedding Data...")
    # texts = df['question']

    embeddings_index = {}
    i = -1
    line = "NO DATA"


    glove_line_count = File.file_len(glove_path, encoding="utf8")
    def process_line(i, line):
        print_progress(i, glove_line_count)
        try:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
            print_progress(i+1, glove_line_count)
        except Exception as ex:
            logger.error(
                "An error occurred while working on glove file [line {0}]:\n"
                "Line text:\t{1}\nGlove path:\t{2}\n"
                "{3}".format(
                    i, line, glove_path, ex))
            raise


    # with open(glove_path, 'r') as glove_file:
    with VerboseTimer("Embedding"):
        with open(glove_path, 'r', encoding="utf8") as glove_file:
            [process_line(i=i, line=line)for i, line in enumerate(glove_file)]



    embedding_matrix = np.zeros((num_words, dim_embedding))
    word_index = metadata['ix_to_word']

    with VerboseTimer("Creating matrix"):
        embedding_tupl = ((word, i, embeddings_index.get(word)) for word, i in word_index.items())
        embedded_with_values = [(word, i, embedding_vector) for word, i, embedding_vector in embedding_tupl if embedding_vector is not None]

        for word, i, embedding_vector in embedded_with_values:
            embedding_matrix[i] = embedding_vector


    e = {tpl[0] for tpl in embedded_with_values}
    w = set(word_index.keys())
    words_with_no_embedding = w-e
    rnd = random.sample(words_with_no_embedding , 5)
    logger.debug("{0} words did not have embedding. e.g.:\n{1}".format(len(words_with_no_embedding),rnd))

    with VerboseTimer("Dumping matrix"):
        with h5py.File(embedding_filename, 'w') as f:
            f.create_dataset('embedding_matrix', data=embedding_matrix)

    return embedding_matrix

if os.path.exists(embedding_matrix_filename):
    logger.debug("Embedding Data already exists. Loading...")
    with h5py.File(embedding_matrix_filename) as f:
        embedding_train = np.array(f['embedding_matrix'])    
else:
    embedding_train = prepare_embeddings(meta_train)
    
embedding_matrix = embedding_train
embedding_matrix

21:21:08,87 pythonVQA DEBUG ## Embedding Data already exists. Loading...


array([[-0.44398999,  0.12817   , -0.25246999, ..., -0.20043001,
        -0.082191  , -0.06255   ],
       [ 0.08561   ,  0.077471  , -1.01680005, ..., -0.30044001,
         0.012508  ,  0.24875   ],
       [-0.16277   ,  0.033858  , -0.39416999, ...,  0.20255999,
        -0.17546999, -0.30397999],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.21359   ,  0.85279   ,  0.48688999, ..., -0.19047   ,
        -0.058526  , -0.49094   ],
       [ 0.72328001, -0.1178    , -0.022166  , ...,  0.49592999,
        -0.16937999, -0.58451003]])