# Preprocessing Data

In this notebook we will produce the  model input data.  
For that, we will  **clean** the data and **enrich**  it, and then **extract** features from raw data.  
This includes:  
1. Cleaning the questions / answers (removing stop words, tokenizing)
2. Enrichment: Marking diagnosis using thumb rules (Note: Eventually, we did not use this data)  
3. Enrichment: Adding a question category to data (given in train / validation sets, thumb rules + prediction to test set)
4. Pre processing: Getting Embedding for questions (get_text_features)  
    For this, we used Spacy's NLP package

### Some main functions we used:

In [1]:
import IPython
from common.functions import get_highlighted_function_code

#### get_text_features for getting embedding of text

In [2]:
from pre_processing.prepare_data import get_text_features
code = get_highlighted_function_code(get_text_features,remove_comments=True)
IPython.display.display(code)

#### pre_process_raw_data for the data pre processing:

In [3]:
from pre_processing.prepare_data import  pre_process_raw_data
code = get_highlighted_function_code(pre_process_raw_data,remove_comments=True)
IPython.display.display(code)

#### Cleaning the data:

In [4]:
# from pre_processing.data_cleaning import clean_data
# code = get_highlighted_function_code(clean_data,remove_comments=True)
# IPython.display.display(code)

#### Enriching the data

In [5]:
# from pre_processing.data_enrichment import enrich_data
# code = get_highlighted_function_code(enrich_data,remove_comments=True)
# IPython.display.display(code)

---
## The code:

In [6]:
# %%capture
from common.settings import get_nlp, data_access
from common.functions import get_image,  get_size
from pre_processing.prepare_data import get_text_features, pre_process_raw_data
from pre_processing.data_enrichment import enrich_data
from pre_processing.data_cleaning import clean_data
from common.utils import VerboseTimer
from collections import Counter
import os
from pandas import HDFStore
import pyarrow as pa
import pyarrow.parquet as pq
import logging
from pathlib import Path

In [7]:
logger = logging.getLogger(__name__)

##### Getting the nlp engine
(doing it once - it is a singleton)

In [8]:
nlp = get_nlp()

[2021-09-20 11:09:30][common.settings][DEBUG] using embedding vector: en_core_web_lg
[2021-09-20 11:09:32][common.settings][DEBUG] Got NLP engine (en_core_web_lg)


Getting the raw input

In [9]:
image_name_question = data_access.load_raw_input()

[2021-09-20 11:09:32][data_access.api][DEBUG] Loading data from: C:\Users\avitu\Documents\GitHub\VQA-MED\VQA-MED\VQA.Python\data\raw_data.h5
[2021-09-20 11:09:32][common.utils][DEBUG] Starting 'Loading raw data'
[2021-09-20 11:09:32][common.utils][DEBUG] Loading raw data: 0:00:00.127408


In [10]:
image_name_question.head()

Unnamed: 0,image_name,question,answer,group,path
0,synpic41148,what kind of image is this?,cta - ct angiography,train,C:\Users\Public\Documents\Data\2019\train\Trai...
1,synpic43984,is this a t1 weighted image?,no,train,C:\Users\Public\Documents\Data\2019\train\Trai...
2,synpic38930,what type of imaging modality is used to acqui...,us - ultrasound,train,C:\Users\Public\Documents\Data\2019\train\Trai...
3,synpic52143,is this a noncontrast mri?,no,train,C:\Users\Public\Documents\Data\2019\train\Trai...
4,synpic20934,what type of image modality is this?,xr - plain film,train,C:\Users\Public\Documents\Data\2019\train\Trai...


## Clean and enrich the data

In [11]:
orig_image_name_question = image_name_question.copy()
image_name_question = clean_data(image_name_question)
image_name_question = enrich_data(image_name_question)

Looking for word: arch: 100%|██████████| 80/80 [00:08<00:00,  9.92it/s]             


In [12]:
groups = image_name_question.groupby('group')
groups.describe()

Unnamed: 0_level_0,answer,answer,answer,answer,diagnosis,diagnosis,diagnosis,diagnosis,image_name,image_name,...,processed_question,processed_question,question,question,question,question,question_category,question_category,question_category,question_category
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,...,top,freq,count,unique,top,freq,count,unique,top,freq
group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
test,0,0,,,500,1,,500,500,500,...,what is the primary abnormality in this image?,24,500,138,what is the primary abnormality in this image?,24,0,0,,
train,12792,1552,axial,1558.0,12792,274,,10569,12792,3200,...,what abnormality is seen in the image?,776,12792,247,what abnormality is seen in the image?,776,12792,4,Plane,3200.0
validation,2000,470,axial,213.0,2000,133,,1669,2000,500,...,what abnormality is seen in the image?,133,2000,186,what abnormality is seen in the image?,133,2000,4,Abnormality,500.0


In [13]:
image_name_question.head()
image_name_question.sample(n=4)

Unnamed: 0,image_name,question,answer,group,path,processed_question,processed_answer,diagnosis,question_category
9674,synpic49060,what is the primary abnormality in this image?,craniopharyngioma,train,C:\Users\Public\Documents\Data\2019\train\Trai...,what is the primary abnormality in this image?,craniopharyngioma,,Abnormality
5974,synpic27816,in what plane is this ultrasound taken?,sagittal,train,C:\Users\Public\Documents\Data\2019\train\Trai...,in what plane is this ultrasound taken?,sagittal,,Plane
5134,synpic28093,which plane is this image taken?,lateral,train,C:\Users\Public\Documents\Data\2019\train\Trai...,which plane is this image taken?,lateral,,Plane
5739,synpic35235,in what plane is this mri?,axial,train,C:\Users\Public\Documents\Data\2019\train\Trai...,in what plane is this mr?,axial,,Plane


## Do the actual pre processing

#### If running in an exported notebook, use the following:
(indent everything to be under the main guard) - for avoiding recursive spawning of processes

In [14]:
from multiprocessing import freeze_support
if __name__ == '__main__':
    print('in main')
    freeze_support()

in main


Note:  
This might take a while...

In [15]:
logger.debug('----===== Preproceccing train data =====----')
image_name_question_processed = pre_process_raw_data(image_name_question)

[2021-09-20 11:09:58][__main__][DEBUG] ----===== Preproceccing train data =====----
[2021-09-20 11:09:58][common.utils][DEBUG] Starting 'Pre processing'
[2021-09-20 11:09:59][pre_processing.prepare_data][INFO] Answer: removing stop words and tokenizing
[2021-09-20 11:09:59][common.utils][DEBUG] Starting 'Answer Tokenizing'
[2021-09-20 11:09:59][common.utils][DEBUG] Answer Tokenizing: 0:00:00.229231
[2021-09-20 11:09:59][pre_processing.prepare_data][INFO] Question: removing stop words and tokenizing
[2021-09-20 11:09:59][common.utils][DEBUG] Starting 'Question Tokenizing'
[2021-09-20 11:09:59][common.utils][DEBUG] Question Tokenizing: 0:00:00.431440
[2021-09-20 11:09:59][pre_processing.prepare_data][INFO] Getting answers embedding
[2021-09-20 11:09:59][common.utils][DEBUG] Starting 'Answer Embedding'
[2021-09-20 11:11:22][common.utils][DEBUG] Answer Embedding: 0:01:22.636008
[2021-09-20 11:11:22][pre_processing.prepare_data][INFO] Getting questions embedding
[2021-09-20 11:11:22][common

Using TensorFlow backend.


[2021-09-20 11:13:06][pre_processing.prepare_data][DEBUG] Done


In [16]:
image_name_question_processed.sample(2)

Unnamed: 0,image_name,question,answer,group,path,processed_question,processed_answer,diagnosis,question_category,answer_embedding,question_embedding
11819,synpic43978.jpg,what is most alarming about this ultrasound?,"hematoma, testicular",train,C:\Users\Public\Documents\Data\2019\train\Trai...,what is most alarming about this ultrasound,hematoma testicular,hematoma,Abnormality,"[3.7429308891296387, -1.7185598611831665, -0.0...","[-3.023021697998047, 0.5958864688873291, -0.79..."
389,synpic57293.jpg,is this a t2 weighted image?,no,train,C:\Users\Public\Documents\Data\2019\train\Trai...,is this t2 weighted image,no,,Modality,"[0.029011979699134827, 1.9719411134719849, 1.5...","[1.4450395107269287, 0.4569704532623291, -3.04..."


Take a look at data of a single image:

In [17]:
image_name_question_processed[image_name_question_processed.image_name == 'synpic52143.jpg'].head()

Unnamed: 0,image_name,question,answer,group,path,processed_question,processed_answer,diagnosis,question_category,answer_embedding,question_embedding
3,synpic52143.jpg,is this a noncontrast mri?,no,train,C:\Users\Public\Documents\Data\2019\train\Trai...,is this noncontrast mri,no,,Modality,"[0.029011979699134827, 1.9719411134719849, 1.5...","[1.2045111656188965, 0.6815400123596191, -3.26..."
3203,synpic52143.jpg,which plane is the image shown in?,coronal,train,C:\Users\Public\Documents\Data\2019\train\Trai...,which plane is the image shown in,coronal,,Plane,"[-2.5162551403045654, -0.6533107757568359, 0.8...","[-2.4232277870178223, 4.579081058502197, 0.132..."
6403,synpic52143.jpg,the mri shows what organ system?,spine and contents,train,C:\Users\Public\Documents\Data\2019\train\Trai...,the mri shows what organ system,spine and contents,,Organ,"[3.57601261138916, 2.5560226440429688, 1.97663...","[2.0428223609924316, 0.2528434097766876, -1.45..."
9603,synpic52143.jpg,what is the primary abnormality in this image?,bone tumor/ chordoma,train,C:\Users\Public\Documents\Data\2019\train\Trai...,what is the primary abnormality in this image,bone tumor chordoma,tumor bone,Abnormality,"[1.3663643598556519, 0.21053718030452728, -2.3...","[-2.6657872200012207, 1.1844078302383423, 0.02..."


In [18]:
from collections import Counter

How many categories did we get for questions?

In [19]:
print('--Test--')
print(Counter(image_name_question_processed[image_name_question_processed.group=='test'].question_category.values))
print('--All--')
print(Counter(image_name_question_processed.question_category.values))

--Test--
Counter({'Organ': 126, 'Modality': 125, 'Plane': 125, 'Abnormality': 114, 'Abnormality_yes_no': 10})
--All--
Counter({'Organ': 3826, 'Modality': 3825, 'Plane': 3825, 'Abnormality': 3673, 'Abnormality_yes_no': 143})


#### Saving the data, so later on we don't need to compute it again

In [20]:
saved_path = data_access.save_processed_data(image_name_question_processed)

[2021-09-20 11:13:07][data_access.api][DEBUG] Saving the processed data to:
C:\Users\avitu\Documents\GitHub\VQA-MED\VQA-MED\VQA.Python\data\model_input.parquet
[2021-09-20 11:13:07][common.utils][DEBUG] Starting 'Saving processed data'
[2021-09-20 11:13:36][common.utils][DEBUG] Saving processed data: 0:00:29.574232


In [21]:
print(f'Data saved at:\n{saved_path}')

Data saved at:
C:\Users\avitu\Documents\GitHub\VQA-MED\VQA-MED\VQA.Python\data\model_input.parquet
