# Creating Meta data

In this note book we will create the meta data.  
Meta data holds information about which unique words and answers exists in training & validation datasets, and in which categories they appeared.  
Later in the process, this information will allow us to build dedicated models for each category.

### Some main functions we used:

In [1]:
import IPython
from common.functions import get_highlighted_function_code

In [2]:
from pre_processing.meta_data import create_meta
code = get_highlighted_function_code(create_meta,remove_comments=False)
IPython.display.display(code)  

---
## The code:

In [3]:
import os
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [4]:
from common.settings import data_access
import vqa_logger 
from pre_processing.meta_data import create_meta

Creating the meta data. Note the only things required in the input dataframe are:
1. image_name
2. processed question
3. processed answer


In [5]:
# index	image_name	question	answer	group	path	original_question	original_answer	tumor	hematoma	brain	abdomen	neck	liver	imaging_device	answer_embedding	question_embedding	is_imaging_device_question
df_data = data_access.load_processed_data(columns=['path','question','answer', 'processed_question','processed_answer', 'group','question_category'])        
df_data = df_data[df_data.group.isin(['train','validation', 'test'])]
print(f'Data length: {len(df_data)}')        

[2021-09-20 11:14:30][data_access.api][DEBUG] loading processed data from:
C:\Users\avitu\Documents\GitHub\VQA-MED\VQA-MED\VQA.Python\data\model_input.parquet
[2021-09-20 11:14:30][data_access.api][DEBUG] loading parquet from:
C:\Users\avitu\Documents\GitHub\VQA-MED\VQA-MED\VQA.Python\data\model_input.parquet
[2021-09-20 11:14:30][common.utils][DEBUG] Starting 'Loading parquet'
[2021-09-20 11:14:30][common.utils][DEBUG] Loading parquet: 0:00:00.025920
[2021-09-20 11:14:30][common.utils][DEBUG] Starting 'Converting to pandas'
[2021-09-20 11:14:30][common.utils][DEBUG] Converting to pandas: 0:00:00.019710
Data length: 15292


The input data:

In [6]:
df_data.sample(2)

Unnamed: 0,path,question,answer,processed_question,processed_answer,question_category,group
8789,C:\Users\Public\Documents\Data\2019\train\Train_images\synpic40625.jpg,what organ system is pictured here?,skull and contents,what organ system is pictured here,skull and contents,Organ,train
5834,C:\Users\Public\Documents\Data\2019\train\Train_images\synpic16441.jpg,in which plane is the ct scan displayed?,axial,in which plane is the ct scan displayed,axial,Plane,train


In [7]:
print("----- Creating meta -----")
meta_data_dict = create_meta(df_data)

----- Creating meta -----
[2021-09-20 11:14:30][pre_processing.meta_data][DEBUG] Data frame had 14792 rows


#### Saving the data, so later on we don't need to compute it again

In [8]:
print("----- Saving meta -----")
data_access.save_meta(meta_data_dict)

----- Saving meta -----
[2021-09-20 11:14:31][data_access.api][DEBUG] Meta number of unique answers: 1675
[2021-09-20 11:14:31][data_access.api][DEBUG] Meta number of unique words: 2073


##### Test Loading:

In [9]:
loaded_meta = data_access.load_meta()
answers_meta = loaded_meta['answers']
words_meta = loaded_meta['words']


answers_meta.question_category.describe()
# words_meta.question_category.describe()

# answers_meta.sample(5)
# words_meta.sample(5)

# words_meta.question_category.drop_duplicates()

count     1675       
unique    5          
top       Abnormality
freq      1604       
Name: question_category, dtype: object

View the data:

In [10]:
from IPython.display import display_html
def display_side_by_side(*data_frames):
    html_str=''
    for df in data_frames:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

display_side_by_side(answers_meta.sample(5), words_meta.sample(5))   

Unnamed: 0,processed_answer,question_category
1355,right sided aortic arch,Abnormality
609,full thickness rotator cuff tear,Abnormality
432,cystic hygroma turner syndrome,Abnormality
1304,pulmonary sequestration,Abnormality
1348,rickets due to renal failure,Abnormality

Unnamed: 0,word,question_category
269,cavernous,Abnormality
1039,lefort,Abnormality
124,arteriovenous,Abnormality
910,ileal,Abnormality
2006,variants,Abnormality
