# Creating Meta data

In this note book we will create the meta data.  
Meta data holds information about which unique words and answers exists in training & validation datasets, and in which categories they appeared.  
Later in the process, this information will allow us to build dedicated models for each category.

### Some main functions we used:

In [1]:
import IPython
from common.functions import get_highlighted_function_code

In [2]:
from pre_processing.meta_data import create_meta
code = get_highlighted_function_code(create_meta,remove_comments=False)
IPython.display.display(code)  

---
## The code:

In [3]:
import os
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [4]:
from common.settings import data_access
import vqa_logger 
from pre_processing.meta_data import create_meta

Creating the meta data. Note the only things required in the input dataframe are:
1. image_name
2. processed question
3. processed answer


In [5]:
# index	image_name	question	answer	group	path	original_question	original_answer	tumor	hematoma	brain	abdomen	neck	liver	imaging_device	answer_embedding	question_embedding	is_imaging_device_question
df_data = data_access.load_processed_data(columns=['path','question','answer', 'processed_question','processed_answer', 'group','question_category'])        
df_data = df_data[df_data.group.isin(['train','validation', 'test'])]
print(f'Data length: {len(df_data)}')        

[2021-09-20 14:03:30][data_access.api][DEBUG] loading processed data from:
C:\Users\avitu\Documents\GitHub\VQA-MED\VQA-MED\VQA.Python\data\model_input.parquet
[2021-09-20 14:03:30][data_access.api][DEBUG] loading parquet from:
C:\Users\avitu\Documents\GitHub\VQA-MED\VQA-MED\VQA.Python\data\model_input.parquet
[2021-09-20 14:03:30][common.utils][DEBUG] Starting 'Loading parquet'
[2021-09-20 14:03:30][common.utils][DEBUG] Loading parquet: 0:00:00.027331
[2021-09-20 14:03:30][common.utils][DEBUG] Starting 'Converting to pandas'
[2021-09-20 14:03:30][common.utils][DEBUG] Converting to pandas: 0:00:00.023024
Data length: 15292


The input data:

In [6]:
df_data.sample(2)

Unnamed: 0,path,question,answer,processed_question,processed_answer,question_category,group
14163,C:\Users\Public\Documents\Data\2019\validation\Val_images\synpic22755.jpg,what is the plane?,axial,what is the plane,axial,Plane,validation
5272,C:\Users\Public\Documents\Data\2019\train\Train_images\synpic25128.jpg,which plane is the image taken?,sagittal,which plane is the image taken,sagittal,Plane,train


In [7]:
print("----- Creating meta -----")
meta_data_dict = create_meta(df_data)

----- Creating meta -----
[2021-09-20 14:03:30][pre_processing.meta_data][DEBUG] Data frame had 14792 rows


#### Saving the data, so later on we don't need to compute it again

In [8]:
print("----- Saving meta -----")
data_access.save_meta(meta_data_dict)

----- Saving meta -----
[2021-09-20 14:03:31][data_access.api][DEBUG] Meta number of unique answers: 1675
[2021-09-20 14:03:31][data_access.api][DEBUG] Meta number of unique words: 2073


##### Test Loading:

In [9]:
loaded_meta = data_access.load_meta()
answers_meta = loaded_meta['answers']
words_meta = loaded_meta['words']


answers_meta.question_category.describe()
# words_meta.question_category.describe()

# answers_meta.sample(5)
# words_meta.sample(5)

# words_meta.question_category.drop_duplicates()

count     1675                          
unique    56                            
top       Abnormality_skull_and_contents
freq      422                           
Name: question_category, dtype: object

View the data:

In [10]:
from IPython.display import display_html
def display_side_by_side(*data_frames):
    html_str=''
    for df in data_frames:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

display_side_by_side(answers_meta.sample(5), words_meta.sample(5))   

Unnamed: 0,processed_answer,question_category
796,intrathoracic kidney,Abnormality_lung_mediastinum_pleura
572,femoral neck insufficiency fracture,Abnormality_musculoskeletal
303,chiari with cervical spine syrinx,Abnormality_spine_and_contents
1469,spondylolisthesis bilateral l3 pars defects,Abnormality_spine_and_contents
501,dystrophic breast calcifications,Abnormality_breast

Unnamed: 0,word,question_category
1749,skull,Abnormality_skull_and_contents Organ Abnormality_face_sinuses_and_neck
1575,pyelonephritis,Abnormality_genitourinary
957,interatrial,Abnormality_lung_mediastinum_pleura Abnormality_heart_and_great_vessels
1535,pregnancy,Abnormality_genitourinary
685,fissure,Abnormality_lung_mediastinum_pleura Abnormality_skull_and_contents
