# Automatic Diagnosis Generation Given Chest X-rays With Bahdanau Attention

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Compute-Word-Embeddings-using-Gensim" data-toc-modified-id="Compute-Word-Embeddings-using-Gensim-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Compute Word Embeddings using Gensim</a></span></li><li><span><a href="#Split-data-into-train-and-test" data-toc-modified-id="Split-data-into-train-and-test-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Split data into train and test</a></span></li><li><span><a href="#Data-Tokenization-and-Making-Batches" data-toc-modified-id="Data-Tokenization-and-Making-Batches-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Tokenization and Making Batches</a></span></li><li><span><a href="#Encoder-Decoder-Model" data-toc-modified-id="Encoder-Decoder-Model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Encoder Decoder Model</a></span><ul class="toc-item"><li><span><a href="#X-Ray-Encoder" data-toc-modified-id="X-Ray-Encoder-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>X-Ray Encoder</a></span></li><li><span><a href="#X-Ray-Attention" data-toc-modified-id="X-Ray-Attention-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>X-Ray Attention</a></span></li><li><span><a href="#Decoder" data-toc-modified-id="Decoder-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Decoder</a></span></li><li><span><a href="#Optimizer-and-Loss-Function" data-toc-modified-id="Optimizer-and-Loss-Function-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Optimizer and Loss Function</a></span></li></ul></li><li><span><a href="#Model-Training" data-toc-modified-id="Model-Training-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Model Training</a></span></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Model Evaluation</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Conclusions</a></span></li><li><span><a href="#References" data-toc-modified-id="References-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>References</a></span></li></ul></div>

In [1]:
# mounting drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
drive_path = '/content/gdrive/My Drive/Assignments_Drive/Case_Study_2/Medical_Data'
# specifying paths
txt_path = drive_path + '/ecgen'
img_path = drive_path + '/images'

Mounted at /content/gdrive


In [2]:
from IPython.display import Javascript
display(Javascript('IPython.notebook.execute_cells_below()'))
import warnings
warnings.filterwarnings('ignore')

<IPython.core.display.Javascript object>

In [3]:
# imports
#!pip install tensorflow-gpu==2.3
#!pip install scikit-learn==0.20.4
# !wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
# !wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.txt.gz"

# from tensorflow.python.framework import ops
# ops.disable_eager_execution()

import os
from os import listdir
import io
import time
import re
import random
import pandas as pd
import numpy as np
from numpy import zeros
from numpy import array
from numpy import asarray
from numpy import save
from bs4 import BeautifulSoup
from tqdm import tqdm
import unicodedata
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from PIL import Image
from pickle import dump
from pickle import load 
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu
from gensim.models import word2vec
from gensim.models import Word2Vec

import sklearn
print(sklearn.__version__)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
import pickle

import tensorflow 
print(tensorflow.__version__)
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.applications.densenet import DenseNet121
from tensorflow.keras.preprocessing.image import load_img
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense 
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dropout
from tensorflow.keras import optimizers
from tensorflow.keras.layers import RepeatVector
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import concatenate
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import LayerNormalization
from tensorflow.keras.utils import plot_model
from tensorflow.keras.regularizers import l2
from tensorflow.keras.regularizers import l1
from tensorflow.keras.backend import categorical_crossentropy
from tensorflow.keras.layers import TimeDistributed
from tensorflow.keras.layers import Reshape
from tensorflow.keras.layers import Concatenate
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import Flatten

from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
# setting the random seeds
SEED = 4
os.environ['PYTHONHASHSEED']=str(SEED)
#os.environ['TF_CUDNN_DETERMINISTIC'] = '4'  # new flag present in tf 2.0+
np.random.seed(SEED)
tensorflow.random.set_seed(SEED)

0.22.2.post1
2.3.0


## Load Data

In [4]:
# read data from the pickle file
data = pd.read_pickle(drive_path + '/data_final.pkl')
print(data.shape)
data.head()

(2610, 6)


Unnamed: 0,UID,FINDINGS,IMPRESSION,IMAGES,IMAGE_FEATURE_1,IMAGE_FEATURE_2
0,CXR3691,<start> the heart is normal in size the medias...,<start> no acute disease <end>,"[CXR3691_IM-1842-1001, CXR3691_IM-1842-3003]","(tf.Tensor(0.00026685063, shape=(), dtype=floa...","(tf.Tensor(4.9858256e-05, shape=(), dtype=floa..."
1,CXR3682,<start> the lungs are hypoventilated there is ...,<start> no acute cardiopulmonary abnormality <...,"[CXR3682_IM-1834-1001, CXR3682_IM-1834-2001]","(tf.Tensor(0.00033830438, shape=(), dtype=floa...","(tf.Tensor(6.356468e-05, shape=(), dtype=float..."
2,CXR3685,<start> calcified thoracic aorta mild rightwar...,<start> no acute cardiopulmonary findings <end>,"[CXR3685_IM-1836-1001, CXR3685_IM-1836-1002]","(tf.Tensor(0.00016475626, shape=(), dtype=floa...","(tf.Tensor(0.0002226108, shape=(), dtype=float..."
3,CXR37,<start> the heart is normal in size the medias...,<start> no acute disease <end>,"[CXR37_IM-1847-0001-0001, CXR37_IM-1847-0001-0...","(tf.Tensor(2.0698715e-05, shape=(), dtype=floa...","(tf.Tensor(0.00041303012, shape=(), dtype=floa..."
4,CXR3703,<start> the eamination consists of frontal and...,<start> no evidence of acute cardiopulmonary p...,"[CXR3703_IM-1850-1001, CXR3703_IM-1850-2001]","(tf.Tensor(0.0003913842, shape=(), dtype=float...","(tf.Tensor(5.5506534e-06, shape=(), dtype=floa..."


In [5]:
# see the unique impressions
print(str(len(data.IMPRESSION.unique())) + ' unique impressions')

print(data.IMPRESSION.value_counts()[:170])

661 unique impressions
<start> no acute cardiopulmonary abnormality <end>                                                                                         388
<start> no acute cardiopulmonary findings <end>                                                                                            177
<start> no acute cardiopulmonary disease <end>                                                                                             145
<start> no acute cardiopulmonary abnormalities <end>                                                                                       127
<start> no active disease <end>                                                                                                            105
                                                                                                                                          ... 
<start> mild stable cardiomegaly with mild central pulmonary vascular congestion and interstitial accentuation, edema <

## Compute Word Embeddings using Gensim

In [6]:
# get all the sentences of the Findings and Impression

FIND = data.FINDINGS.values.tolist()
IMPR = data.IMPRESSION.values.tolist()

# get all the sentences in an array
TOTAL = FIND + IMPR

# check lengths of the array
print(len(FIND))
print(len(IMPR))
print(len(TOTAL))

2610
2610
5220


In [7]:
# create a tokenizer for the total text
total_tokenizer = tensorflow.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
total_tokenizer.fit_on_texts(TOTAL)

# get the dict for the total texts
total_dict = total_tokenizer.word_index

# get the tokens
total_tokens = total_tokenizer.texts_to_sequences(TOTAL)

# check lengths of the tokens
print(len(total_tokens))
print(len(total_dict))

5220
1314


In [8]:
# convert the sentences into array of words

def convert_to_sentences(total_tokens):
    sentences = list()
    # get tokens for a sentence
    for tokens in total_tokens:
        sent = list()
        # add each word to a sent
        for token in tokens:
            sent.append(total_tokenizer.index_word[token])
        sentences.append(sent)
    # return the sentences
    return sentences

total_sentences = convert_to_sentences(total_tokens)

In [9]:
embedding_dim = 100
# define training data
sentences = total_sentences
# train model
w2v_model = Word2Vec(sentences, size=embedding_dim, min_count=1)
# print vocab length
print(len(w2v_model.wv.vocab))
# get all the words
total_words = total_tokenizer.word_index.keys()
print(len(total_words))

1314
1314


In [10]:
# create a dict for the word and embeddings
vectors_dict = dict()
# total_words is the list of words 
for key in total_words:
    # vectors_dict is the dict of word and embeddings
    vectors_dict[key] = w2v_model[key]

#print(vectors_dict)

## Split data into train and test

In [11]:
# get the y col and drop the col
Y_Data = data.IMPRESSION
data.drop('IMPRESSION',axis = 1, inplace = True)

X_Train, X_Test, Y_Train, Y_Test = train_test_split(data, Y_Data, test_size=0.2, random_state = 42)

In [12]:
data['IMPRESSION'] = Y_Data

In [13]:
data.head()

Unnamed: 0,UID,FINDINGS,IMAGES,IMAGE_FEATURE_1,IMAGE_FEATURE_2,IMPRESSION
0,CXR3691,<start> the heart is normal in size the medias...,"[CXR3691_IM-1842-1001, CXR3691_IM-1842-3003]","(tf.Tensor(0.00026685063, shape=(), dtype=floa...","(tf.Tensor(4.9858256e-05, shape=(), dtype=floa...",<start> no acute disease <end>
1,CXR3682,<start> the lungs are hypoventilated there is ...,"[CXR3682_IM-1834-1001, CXR3682_IM-1834-2001]","(tf.Tensor(0.00033830438, shape=(), dtype=floa...","(tf.Tensor(6.356468e-05, shape=(), dtype=float...",<start> no acute cardiopulmonary abnormality <...
2,CXR3685,<start> calcified thoracic aorta mild rightwar...,"[CXR3685_IM-1836-1001, CXR3685_IM-1836-1002]","(tf.Tensor(0.00016475626, shape=(), dtype=floa...","(tf.Tensor(0.0002226108, shape=(), dtype=float...",<start> no acute cardiopulmonary findings <end>
3,CXR37,<start> the heart is normal in size the medias...,"[CXR37_IM-1847-0001-0001, CXR37_IM-1847-0001-0...","(tf.Tensor(2.0698715e-05, shape=(), dtype=floa...","(tf.Tensor(0.00041303012, shape=(), dtype=floa...",<start> no acute disease <end>
4,CXR3703,<start> the eamination consists of frontal and...,"[CXR3703_IM-1850-1001, CXR3703_IM-1850-2001]","(tf.Tensor(0.0003913842, shape=(), dtype=float...","(tf.Tensor(5.5506534e-06, shape=(), dtype=floa...",<start> no evidence of acute cardiopulmonary p...


In [14]:
print(Y_Train.shape)
print(Y_Test.shape)

(2088,)
(522,)


In [15]:
# get the shapes of train cv and test data
print(type(X_Train),X_Train.shape,type(Y_Train),Y_Train.shape)
# lets convert the Y_Train to a dataframe 
Y_DTrain = pd.DataFrame(data=Y_Train.to_list(), columns=['IMPRESSION'])
Y_Train = Y_DTrain

print(type(X_Test),X_Test.shape,type(Y_Test),Y_Test.shape)
# lets convert the Y_Test to a dataframe
Y_DTest = pd.DataFrame(data=Y_Test.to_list(), columns=['IMPRESSION'])
Y_Test = Y_DTest

print('checking shapes after converting y to dataframe')
# lets again check the shapes
print(type(X_Train),X_Train.shape,type(Y_Train),Y_Train.shape)
print(type(X_Test),X_Test.shape,type(Y_Test),Y_Test.shape)

<class 'pandas.core.frame.DataFrame'> (2088, 5) <class 'pandas.core.series.Series'> (2088,)
<class 'pandas.core.frame.DataFrame'> (522, 5) <class 'pandas.core.series.Series'> (522,)
checking shapes after converting y to dataframe
<class 'pandas.core.frame.DataFrame'> (2088, 5) <class 'pandas.core.frame.DataFrame'> (2088, 1)
<class 'pandas.core.frame.DataFrame'> (522, 5) <class 'pandas.core.frame.DataFrame'> (522, 1)


In [16]:
# create the train dataframe
data_train = pd.DataFrame()
data_train['UID'] = X_Train.UID.values.tolist()
data_train['IMAGES'] = X_Train.IMAGES.values.tolist()
data_train['IMAGE_FEATURE_1'] = X_Train.IMAGE_FEATURE_1.values.tolist()
data_train['IMAGE_FEATURE_2'] = X_Train.IMAGE_FEATURE_2.values.tolist()
data_train['FINDINGS'] = X_Train.FINDINGS.values.tolist()
data_train['IMPRESSION'] = Y_Train.IMPRESSION.values
print(data_train.shape)
data_train.head()

(2088, 6)


Unnamed: 0,UID,IMAGES,IMAGE_FEATURE_1,IMAGE_FEATURE_2,FINDINGS,IMPRESSION
0,CXR3755,"[CXR3755_IM-1879-1001, CXR3755_IM-1879-3001]","(tf.Tensor(6.0409708e-05, shape=(), dtype=floa...","(tf.Tensor(0.0, shape=(), dtype=float32), tf.T...",<start> heart size upper limits of normal pulm...,<start> no acute changes from prior imaging <end>
1,CXR969,"[CXR969_IM-2459-1001, CXR969_IM-2459-2001]","(tf.Tensor(8.086289e-05, shape=(), dtype=float...","(tf.Tensor(0.0, shape=(), dtype=float32), tf.T...",<start> heart size upper limits of normal but ...,<start> no acute radiographic cardiopulmonary ...
2,CXR771,"[CXR771_IM-2316-2001, CXR771_IM-2316-1001]","(tf.Tensor(0.0003665378, shape=(), dtype=float...","(tf.Tensor(0.0, shape=(), dtype=float32), tf.T...",<start> heart size and pulmonary vascularity w...,<start> no acute cardiopulmonary disease <end>
3,CXR371,"[CXR371_IM-1852-1001, CXR371_IM-1852-2001]","(tf.Tensor(0.0005545218, shape=(), dtype=float...","(tf.Tensor(6.5811844e-05, shape=(), dtype=floa...",<start> the lungs are clear there is hyperinfl...,<start> copd and old granulomatous disease <end>
4,CXR2645,"[CXR2645_IM-1131-2001, CXR2645_IM-1131-1001]","(tf.Tensor(0.00028274013, shape=(), dtype=floa...","(tf.Tensor(1.4752676e-05, shape=(), dtype=floa...",<start> surgical clips within the right upper ...,<start> no acute cardiopulmonary abnormality <...


In [17]:
# create the test dataframe
data_test = pd.DataFrame()
data_test['UID'] = X_Test.UID.values.tolist()
data_test['IMAGES'] = X_Test.IMAGES.values.tolist()
data_test['IMAGE_FEATURE_1'] = X_Test.IMAGE_FEATURE_1.values.tolist()
data_test['IMAGE_FEATURE_2'] = X_Test.IMAGE_FEATURE_2.values.tolist()
data_test['FINDINGS'] = X_Test.FINDINGS.values.tolist()
data_test['IMPRESSION'] = Y_Test.IMPRESSION.values
print(data_test.shape)
data_test.head()

(522, 6)


Unnamed: 0,UID,IMAGES,IMAGE_FEATURE_1,IMAGE_FEATURE_2,FINDINGS,IMPRESSION
0,CXR2327,"[CXR2327_IM-0898-1001, CXR2327_IM-0898-2001]","(tf.Tensor(0.0009207953, shape=(), dtype=float...","(tf.Tensor(8.276213e-05, shape=(), dtype=float...",<start> there has been interval development of...,<start> interval development of large rightsid...
1,CXR1121,"[CXR1121_IM-0080-1001, CXR1121_IM-0080-2001]","(tf.Tensor(0.00041279235, shape=(), dtype=floa...","(tf.Tensor(8.3250285e-05, shape=(), dtype=floa...",<start> the lungs are clear there is no pleura...,<start> no acute pulmonary disease <end>
2,CXR40,"[CXR40_IM-2050-1001, CXR40_IM-2050-1002]","(tf.Tensor(0.00033369806, shape=(), dtype=floa...","(tf.Tensor(0.00030659474, shape=(), dtype=floa...",<start> mild hyperepansion of the lungs numero...,<start> emphysema with no acute cardiopulmonar...
3,CXR3263,"[CXR3263_IM-1549-1001, CXR3263_IM-1549-2001]","(tf.Tensor(0.00041694037, shape=(), dtype=floa...","(tf.Tensor(7.052128e-05, shape=(), dtype=float...",<start> chest the heart size and cardiomediast...,<start> chest no acute cardiopulmonary finding...
4,CXR751,"[CXR751_IM-2305-1001, CXR751_IM-2305-2001]","(tf.Tensor(0.000753897, shape=(), dtype=float3...","(tf.Tensor(0.00013644715, shape=(), dtype=floa...",<start> heart size within normal limits no foc...,<start> no acute cardiopulmonary findings <end>


## Data Tokenization and Making Batches

In [18]:
# get the vocab for impressions
# impression_tokenizer : tokenizer for impression
impression_tokenizer = tensorflow.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
impression_tokenizer.fit_on_texts(data_train['IMPRESSION'])

# saving tokenizer to file
with open(drive_path + '/impression_tokenizer.pickle', 'wb') as handle:
    pickle.dump(impression_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open(drive_path + '/impression_tokenizer.pickle', 'rb') as handle:
    impression_tokenizer = pickle.load(handle)

# get the dict and save it
impression_dict = impression_tokenizer.word_index

# get the tokens
temp_train_impression = impression_tokenizer.texts_to_sequences(data_train['IMPRESSION'])


# Now let us get the max length of the findings
s = []
for d in temp_train_impression:
    s.append(len(d))

pad_length_impression = max(s)

# get the vocab size
vocab_size_impression = len(impression_tokenizer.word_index) + 1  

# print max lengths
print('The vocab size of the impression is',vocab_size_impression)
print('The maximum length of the impression is',pad_length_impression)


The vocab size of the impression is 573
The maximum length of the impression is 40


In [19]:
# get the embedding matrix for the impression
# impression_tokenizer is the tokenizer for impression

# create a weight matrix for words in training docs

embedding_matrix_impression = zeros((vocab_size_impression, embedding_dim))
for word, i in impression_tokenizer.word_index.items():
	embedding_vector = vectors_dict.get(word)
	if embedding_vector is not None:
		embedding_matrix_impression[i] = embedding_vector

# saving tokenizer to file
with open(drive_path + '/embedding_matrix_impression.pickle', 'wb') as handle:
    pickle.dump(embedding_matrix_impression, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open(drive_path + '/embedding_matrix_impression.pickle', 'rb') as handle:
    embedding_matrix_impression = pickle.load(handle)


print(embedding_matrix_impression[1].shape)
print(embedding_matrix_impression.shape)

impression_matrix = embedding_matrix_impression

(100,)
(573, 100)


In [20]:
# this function will take the dataframe and return the tokenized and padded findings and impression
def tokenize(dataset):
    
    impression_tensor = impression_tokenizer.texts_to_sequences(dataset.IMPRESSION)

    impression_tensor = tensorflow.keras.preprocessing.sequence.pad_sequences(impression_tensor, maxlen = pad_length_impression,
                                                         padding='post')

    return impression_tensor

In [21]:
# this function will convert the image array into numpy
def load_imgs(dataset):
    img_feature_1 = dataset.IMAGE_FEATURE_1.values
    tmp_arr_1 = np.zeros((len(img_feature_1), 1024))
    img_feature_2 = dataset.IMAGE_FEATURE_2.values
    tmp_arr_2 = np.zeros((len(img_feature_2), 1024))
    #print(tmp_arr_train.shape)
    i = 0
    for r in img_feature_1:
        # print(r)
        tmp_arr_1[i] = r
        i += 1

    img_feature_1 = tmp_arr_1

    i = 0
    for r in img_feature_2:
        # print(r)
        tmp_arr_2[i] = r
        i += 1

    img_feature_2 = tmp_arr_2
    
    return img_feature_1, img_feature_2

In [22]:
# this function will load the cleaned images, findings and impression
def load_dataset(dataset, purpose = 'testing'):
    # creating cleaned input, output pairs
    impression_tensor = tokenize(dataset)
    img_feature_1, img_feature_2 = load_imgs(dataset)

    return img_feature_1, img_feature_2, impression_tensor 

In [23]:
img_feature_1_train, img_feature_2_train, impression_tensor_train = load_dataset(data_train)

# print('img_features_train', img_features_train.shape, type(img_features_train), type(img_features_train[0]))
# print('findings_tensor_train', findings_tensor_train.shape, type(findings_tensor_train), type(findings_tensor_train[0]))
# print('impression_tensor_train', impression_tensor_train.shape, type(impression_tensor_train), type(impression_tensor_train[0]))

In [24]:
img_feature_1_test, img_feature_2_test, impression_tensor_test = load_dataset(data_test)

# print('img_features_test', img_features_test.shape, type(img_features_test), type(img_features_test[0]))
# print('findings_tensor_test', findings_tensor_test.shape, type(findings_tensor_test), type(findings_tensor_test[0]))
# print('impression_tensor_test', impression_tensor_test.shape, type(impression_tensor_test), type(impression_tensor_test[0]))

In [25]:
# add the tokens to the dataframe
data_impression_tensor = impression_tokenizer.texts_to_sequences(data.IMPRESSION)

data['IMPRESSION_TOKENS'] = data_impression_tensor

data.head()

Unnamed: 0,UID,FINDINGS,IMAGES,IMAGE_FEATURE_1,IMAGE_FEATURE_2,IMPRESSION,IMPRESSION_TOKENS
0,CXR3691,<start> the heart is normal in size the medias...,"[CXR3691_IM-1842-1001, CXR3691_IM-1842-3003]","(tf.Tensor(0.00026685063, shape=(), dtype=floa...","(tf.Tensor(4.9858256e-05, shape=(), dtype=floa...",<start> no acute disease <end>,"[1, 3, 4, 6, 2]"
1,CXR3682,<start> the lungs are hypoventilated there is ...,"[CXR3682_IM-1834-1001, CXR3682_IM-1834-2001]","(tf.Tensor(0.00033830438, shape=(), dtype=floa...","(tf.Tensor(6.356468e-05, shape=(), dtype=float...",<start> no acute cardiopulmonary abnormality <...,"[1, 3, 4, 5, 7, 2]"
2,CXR3685,<start> calcified thoracic aorta mild rightwar...,"[CXR3685_IM-1836-1001, CXR3685_IM-1836-1002]","(tf.Tensor(0.00016475626, shape=(), dtype=floa...","(tf.Tensor(0.0002226108, shape=(), dtype=float...",<start> no acute cardiopulmonary findings <end>,"[1, 3, 4, 5, 8, 2]"
3,CXR37,<start> the heart is normal in size the medias...,"[CXR37_IM-1847-0001-0001, CXR37_IM-1847-0001-0...","(tf.Tensor(2.0698715e-05, shape=(), dtype=floa...","(tf.Tensor(0.00041303012, shape=(), dtype=floa...",<start> no acute disease <end>,"[1, 3, 4, 6, 2]"
4,CXR3703,<start> the eamination consists of frontal and...,"[CXR3703_IM-1850-1001, CXR3703_IM-1850-2001]","(tf.Tensor(0.0003913842, shape=(), dtype=float...","(tf.Tensor(5.5506534e-06, shape=(), dtype=floa...",<start> no evidence of acute cardiopulmonary p...,"[1, 3, 13, 9, 4, 5, 11, 2]"


In [26]:
# add the tokens to the dataframe
data_train_impression_tensor = impression_tokenizer.texts_to_sequences(data_train.IMPRESSION)

data_train['IMPRESSION_TOKENS'] = data_train_impression_tensor

data_train.head()

Unnamed: 0,UID,IMAGES,IMAGE_FEATURE_1,IMAGE_FEATURE_2,FINDINGS,IMPRESSION,IMPRESSION_TOKENS
0,CXR3755,"[CXR3755_IM-1879-1001, CXR3755_IM-1879-3001]","(tf.Tensor(6.0409708e-05, shape=(), dtype=floa...","(tf.Tensor(0.0, shape=(), dtype=float32), tf.T...",<start> heart size upper limits of normal pulm...,<start> no acute changes from prior imaging <end>,"[1, 3, 4, 38, 138, 139, 156, 2]"
1,CXR969,"[CXR969_IM-2459-1001, CXR969_IM-2459-2001]","(tf.Tensor(8.086289e-05, shape=(), dtype=float...","(tf.Tensor(0.0, shape=(), dtype=float32), tf.T...",<start> heart size upper limits of normal but ...,<start> no acute radiographic cardiopulmonary ...,"[1, 3, 4, 19, 5, 11, 2]"
2,CXR771,"[CXR771_IM-2316-2001, CXR771_IM-2316-1001]","(tf.Tensor(0.0003665378, shape=(), dtype=float...","(tf.Tensor(0.0, shape=(), dtype=float32), tf.T...",<start> heart size and pulmonary vascularity w...,<start> no acute cardiopulmonary disease <end>,"[1, 3, 4, 5, 6, 2]"
3,CXR371,"[CXR371_IM-1852-1001, CXR371_IM-1852-2001]","(tf.Tensor(0.0005545218, shape=(), dtype=float...","(tf.Tensor(6.5811844e-05, shape=(), dtype=floa...",<start> the lungs are clear there is hyperinfl...,<start> copd and old granulomatous disease <end>,"[1, 63, 23, 169, 122, 6, 2]"
4,CXR2645,"[CXR2645_IM-1131-2001, CXR2645_IM-1131-1001]","(tf.Tensor(0.00028274013, shape=(), dtype=floa...","(tf.Tensor(1.4752676e-05, shape=(), dtype=floa...",<start> surgical clips within the right upper ...,<start> no acute cardiopulmonary abnormality <...,"[1, 3, 4, 5, 7, 2]"


In [27]:
# add the tokens to the dataframe
data_test_impression_tensor = impression_tokenizer.texts_to_sequences(data_test.IMPRESSION)

data_test['IMPRESSION_TOKENS'] = data_test_impression_tensor

data_test.head()

Unnamed: 0,UID,IMAGES,IMAGE_FEATURE_1,IMAGE_FEATURE_2,FINDINGS,IMPRESSION,IMPRESSION_TOKENS
0,CXR2327,"[CXR2327_IM-0898-1001, CXR2327_IM-0898-2001]","(tf.Tensor(0.0009207953, shape=(), dtype=float...","(tf.Tensor(8.276213e-05, shape=(), dtype=float...",<start> there has been interval development of...,<start> interval development of large rightsid...,"[1, 84, 563, 9, 135, 285, 32, 55, 518, 118, 21..."
1,CXR1121,"[CXR1121_IM-0080-1001, CXR1121_IM-0080-2001]","(tf.Tensor(0.00041279235, shape=(), dtype=floa...","(tf.Tensor(8.3250285e-05, shape=(), dtype=floa...",<start> the lungs are clear there is no pleura...,<start> no acute pulmonary disease <end>,"[1, 3, 4, 12, 6, 2]"
2,CXR40,"[CXR40_IM-2050-1001, CXR40_IM-2050-1002]","(tf.Tensor(0.00033369806, shape=(), dtype=floa...","(tf.Tensor(0.00030659474, shape=(), dtype=floa...",<start> mild hyperepansion of the lungs numero...,<start> emphysema with no acute cardiopulmonar...,"[1, 71, 14, 3, 4, 5, 8, 2]"
3,CXR3263,"[CXR3263_IM-1549-1001, CXR3263_IM-1549-2001]","(tf.Tensor(0.00041694037, shape=(), dtype=floa...","(tf.Tensor(7.052128e-05, shape=(), dtype=float...",<start> chest the heart size and cardiomediast...,<start> chest no acute cardiopulmonary finding...,"[1, 15, 3, 4, 5, 61, 25, 559, 72, 560, 38, 274..."
4,CXR751,"[CXR751_IM-2305-1001, CXR751_IM-2305-2001]","(tf.Tensor(0.000753897, shape=(), dtype=float3...","(tf.Tensor(0.00013644715, shape=(), dtype=floa...",<start> heart size within normal limits no foc...,<start> no acute cardiopulmonary findings <end>,"[1, 3, 4, 5, 8, 2]"


In [28]:
print('train length ', impression_tensor_train.shape[0])
print('test length', impression_tensor_test.shape[0])

train_len = impression_tensor_train.shape[0]
test_len = impression_tensor_test.shape[0]

train length  2088
test length 522


## Encoder Decoder Model

In [29]:
# setting some vartables
BUFFER_SIZE = len(impression_tensor_train)
BATCH_SIZE = 128
steps_per_epoch = len(impression_tensor_train)//BATCH_SIZE
units = 256
vocab_tar_size = len(impression_tokenizer.word_index)+1

# creating the tensorflow datasets
dataset_train = tensorflow.data.Dataset.from_tensor_slices((img_feature_1_train, img_feature_2_train, impression_tensor_train)).shuffle(BUFFER_SIZE)
dataset_train = dataset_train.batch(BATCH_SIZE, drop_remainder=True)

dataset_test = tensorflow.data.Dataset.from_tensor_slices((img_feature_1_test, img_feature_2_test, impression_tensor_test)).shuffle(BUFFER_SIZE)
dataset_test = dataset_test.batch(BATCH_SIZE, drop_remainder=True)

In [30]:
# looking at the example batch shapes
example_img1_batch, example_img2_batch, example_target_batch = next(iter(dataset_train))
example_img1_batch.shape ,example_img2_batch.shape, example_target_batch.shape

(TensorShape([128, 1024]), TensorShape([128, 1024]), TensorShape([128, 40]))

### X-Ray Encoder 

In [31]:
# this class is for the x-ray features encoder 
class Encoder_Xray(tensorflow.keras.Model):
    # Since you have already extracted the features and dumped it using pickle
    # This encoder passes those features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(Encoder_Xray, self).__init__()
        # shape after fc == (batch_size, 64, embedding_dim)
        self.fc = tensorflow.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = self.fc(x)
        x = tensorflow.nn.relu(x)
        return x

### X-Ray Attention

In [32]:
# this class is for the xray features Attention
class BahdanauAttention_Xray(tensorflow.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention_Xray, self).__init__()
    self.W1 = tensorflow.keras.layers.Dense(units)
    self.W2 = tensorflow.keras.layers.Dense(units)
    self.W3 = tensorflow.keras.layers.Dense(units)
    self.W4 = tensorflow.keras.layers.Dense(units)
    self.V = tensorflow.keras.layers.Dense(1)
    self.add = tensorflow.keras.layers.Add()

  def call(self, features, hidden):
    # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)
    # hidden shape == (batch_size, hidden_size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden_size)
    hidden_with_time_axis = tensorflow.expand_dims(hidden, 1)
    # score shape == (batch_size, 64, hidden_size)
    score = tensorflow.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
    # attention_weights shape == (batch_size, 64, 1)
    # you get 1 at the last axis because you are applying score to self.V
    attention_weights = tensorflow.nn.softmax(self.V(score), axis=1)
    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * features
    context_vector = tensorflow.reduce_sum(context_vector, axis=1)
    return context_vector, attention_weights
# this class is for the decoder

### Decoder

In [33]:
class Decoder(tensorflow.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(Decoder, self).__init__()
        self.units = units

        self.embedding = tensorflow.keras.layers.Embedding(vocab_size, embedding_dim, weights=[impression_matrix], mask_zero=True)
        self.gru = tensorflow.keras.layers.GRU(self.units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
        self.fc1 = tensorflow.keras.layers.Dense(self.units, activation='relu')
        self.fc2 = tensorflow.keras.layers.Dense(vocab_size)

        self.attention1 = BahdanauAttention_Xray(self.units)
        self.attention2 = BahdanauAttention_Xray(self.units)


    def call(self, x = np.zeros((1,1)), features1 = np.zeros((1,100)), features2 = np.zeros((1,100)), hidden = np.zeros((1,256))):
        # defining attention as a separate model
        context_vector1, attention_weights1 = self.attention1(features1, hidden)
        context_vector2, attention_weights2 = self.attention2(features2, hidden)
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)
        x = tensorflow.concat([tensorflow.expand_dims(context_vector1, 1), tensorflow.expand_dims(context_vector2, 1), x], axis=-1)
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        # passing the concatenated vector to the GRU
        output, state = self.gru(x)
        # shape == (batch_size, max_length, hidden_size)
        x = self.fc1(output)
        # x_shape == (batch_size * max_length, hidden_size)
        x = tensorflow.reshape(x, (-1, x.shape[2]))
        # output_shape == (batch_size * max_length, vocab)
        x = self.fc2(x)

        return x, state

    def reset_state(self, batch_size):
        return tensorflow.zeros((batch_size, self.units))

### Optimizer and Loss Function

In [34]:
# initializing the optimizer and the loss function
optimizer = tensorflow.keras.optimizers.Adam(0.01)
loss_object = tensorflow.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
    # get the mask
    mask = tensorflow.math.logical_not(tensorflow.math.equal(real, 0))
    # calculate loss
    loss_ = loss_object(real, pred)
    # cast mask
    mask = tensorflow.cast(mask, dtype=loss_.dtype)
    # loss = loss * mask
    loss_ *= mask
    # normalize loss
    loss_ /= pad_length_impression
    # calculate mean and return loss
    return tensorflow.reduce_mean(loss_)

## Model Training

In [35]:
# initialize model
encoder1 = Encoder_Xray(embedding_dim)
encoder2 = Encoder_Xray(embedding_dim)
decoder = Decoder(embedding_dim, units, vocab_tar_size)

In [36]:
@tensorflow.function
def train_step(img_tensor1, img_tensor2, target):
    # initialize loss
    loss = 0

    # initializing the hidden state for each batch
    # because the impressions are not related from image to image
    hidden = decoder.reset_state(batch_size=target.shape[0])
    # put the first input
    dec_input = tensorflow.expand_dims([impression_tokenizer.word_index['<start>']] * target.shape[0], 1)
    # using gradient taping
    with tensorflow.GradientTape() as tape:
        # get image features
        features1 = encoder1(img_tensor1)
        features2 = encoder2(img_tensor2)
        # loop over all the words in the impression
        for i in range(1, target.shape[1]):
            # get the predections
            predictions, hidden = decoder(dec_input, features1, features2, hidden)
            # calculate loss
            loss += loss_function(target[:, i], predictions)
            # using teacher forcing
            dec_input = tensorflow.expand_dims(target[:, i], 1)
            #print(dec_input.shape)
    total_loss = loss 
    # get trainable variables
    trainable_variables = encoder1.trainable_variables + encoder2.trainable_variables + decoder.trainable_variables
    # get gradients
    gradients = tape.gradient(loss, trainable_variables)
    # apply gradients
    optimizer.apply_gradients(zip(gradients, trainable_variables))
    # return loss
    return total_loss



# function to calculate blue score
def calc_blue(img1, img2, target):
    # initializing the hidden state for each batch
    # because the impressions are not related from image to image
    hidden = decoder.reset_state(batch_size=1)

    # reshape image features
    img1 = tensorflow.keras.backend.reshape(img1, shape=(1, -1))
    img2 = tensorflow.keras.backend.reshape(img2, shape=(1, -1))
    # get the target sentence
    target_sent = list()
    for t in target:
        target_sent.append(impression_tokenizer.index_word[t])
    # get image features
    features1 = encoder1(img1)
    features2 = encoder2(img2)
    # initial decoder input
    dec_input = tensorflow.expand_dims([impression_tokenizer.word_index['<start>']], 0)
    # initialize the result array
    result = []
    result.append('<start>')
    # loop for the entire pad lenght
    for i in range(pad_length_impression):
        # predict
        predictions, hidden = decoder(dec_input, features1, features2, hidden)

        # calculate max
        predicted_id = predictions.numpy().argmax()
        # append the predicted word to result array
        result.append(impression_tokenizer.index_word[predicted_id])
        # if '<end>' is reached
        if impression_tokenizer.index_word[predicted_id] == '<end>':
            # calculate bleu score and return
            score = sentence_bleu([target_sent], result)
            return score
        # the next input to the model is predected at this step
        dec_input = tensorflow.expand_dims([predicted_id], 0)
        
    # calculate score at the end and return it
    score = sentence_bleu([target_sent], result)
    return score


In [37]:
# epochs
EPOCHS = 250

# define loss
training_loss = tensorflow.keras.metrics.Mean(name='training_loss')

for epoch in range(EPOCHS):
    start = time.time()
    # get the batch
    for (batch, (img_tensor1_train, img_tensor2_train, target_train)) in enumerate(dataset_train):
        # go to training step
        t_loss = train_step(img_tensor1_train, img_tensor2_train, target_train)
        # normalize loss over the data
        training_loss(t_loss)
    
    print ('Epoch {} Training Loss {:.6f}'.format(epoch + 1, training_loss.result()))
    
    print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Epoch 1 Training Loss 0.764666
Time taken for 1 epoch 43.1100537776947 sec

Epoch 2 Training Loss 0.678342
Time taken for 1 epoch 1.7557966709136963 sec

Epoch 3 Training Loss 0.619599
Time taken for 1 epoch 1.751091480255127 sec

Epoch 4 Training Loss 0.571592
Time taken for 1 epoch 1.75447416305542 sec

Epoch 5 Training Loss 0.535569
Time taken for 1 epoch 1.7518906593322754 sec

Epoch 6 Train

In [38]:
# calculate blue scores
# train bleu score
train_bleu = 0
# get train df values
data_train_vals = data_train.values
for val in data_train_vals:
    bleu = calc_blue(val[2], val[3], val[-1])
    train_bleu += bleu
train_bleu /= train_len

print ('Train BLEU score {:.6f}'.format(train_bleu))

# test bleu score
test_bleu = 0
# get test df values
data_test_vals = data_test.values
for val in data_test_vals:
    bleu = calc_blue(val[2], val[3], val[-1])
    test_bleu += bleu
test_bleu /= test_len

print ('Test BLEU score {:.6f}'.format(test_bleu))

Train BLEU score 0.442821
Test BLEU score 0.442904


## Model Evaluation

In [39]:
# this function will do greedy search
def greedy_search(img1, img2, target):
    # reset hidden states
    hidden = decoder.reset_state(batch_size=1)
    # reshape img vectors
    img1 = tensorflow.keras.backend.reshape(img1, shape=(1, -1))
    img2 = tensorflow.keras.backend.reshape(img2, shape=(1, -1))
    # get image features
    features1 = encoder1(img1)
    features2 = encoder2(img2)
    # decoder input = start
    dec_input = tensorflow.expand_dims([impression_tokenizer.word_index['<start>']], 0)
    result = []
    result.append('<start>')
    # loop for pad length
    for i in range(pad_length_impression):
        
        # get predections
        predictions, hidden = decoder(dec_input, features1, features2, hidden)
        # get argmax of predicted id
        predicted_id = predictions.numpy().argmax()
        result.append(impression_tokenizer.index_word[predicted_id])
        # if end is reached return
        if impression_tokenizer.index_word[predicted_id] == '<end>':
            return result
        # predicted output = next input
        dec_input = tensorflow.expand_dims([predicted_id], 0)
    # return
    return result


In [40]:
# get 5 random samples
data_eval = data.sample(5)
data_eval_vals = data_eval.values

# print greedy search outputs
for val in data_eval_vals:
    print('___________________________________________________NEW__DATA__POINT___________________________________________________')
    result = greedy_search(val[3], val[4], val[5])
    print('Actual Impression ', val[5])
    print('Generated Impression ', result)

___________________________________________________NEW__DATA__POINT___________________________________________________
Actual Impression  <start> no acute cardiopulmonary disease <end>
Generated Impression  ['<start>', 'no', 'acute', 'disease', '<end>']
___________________________________________________NEW__DATA__POINT___________________________________________________
Actual Impression  <start> no acute cardiopulmonary abnormality <end>
Generated Impression  ['<start>', 'no', 'acute', 'cardiopulmonary', 'disease', '<end>']
___________________________________________________NEW__DATA__POINT___________________________________________________
Actual Impression  <start> no acute cardiopulmonary disease <end>
Generated Impression  ['<start>', 'no', 'acute', 'disease', 'negative', 'for', 'hemoptysis', 'to', 'previous', 'acute', 'disease', 'negative', 'for', 'hemoptysis', 'to', 'previous', 'acute', 'disease', 'negative', 'for', 'hemoptysis', 'to', 'previous', 'acute', 'disease', 'negative',

In [41]:
# this function will do beam search
def beam_search(img1, img2, target, beam_width = 5):

    # reset hidden states
    hidden = decoder.reset_state(batch_size=1)
    # reshape img vectors
    img1 = tensorflow.keras.backend.reshape(img1, shape=(1, -1))
    img2 = tensorflow.keras.backend.reshape(img2, shape=(1, -1))
    # get image features
    features1 = encoder1(img1)
    features2 = encoder2(img2)
    # decoder input = start
    start = [impression_tokenizer.word_index['<start>']]
    start_word = [[start, 0.0]]
    
    while len(start_word[0][0]) < pad_length_impression:
        temp = []
        for s in start_word:
            
            dec_input = pad_sequences([[s[0][-1]]])
            # get the predections 
            preds, hidden = decoder(dec_input, features1, features2, hidden)
            
            # Getting the top <beam_width>(n) predictions
            top_words = np.argsort(preds).flatten()
            word_preds = top_words[-beam_width:]
            
            # creating a new list so as to put them via the model again
            for w in word_preds:
                next_cap, prob = s[0][:], s[1]
                next_cap.append(w)
                prob += preds[0][w]
                temp.append([next_cap, prob])
                    
        start_word = temp
        # Sorting according to the probabilities
        start_word = sorted(start_word, reverse=False, key=lambda l: l[1])
        # Getting the top words
        start_word = start_word[-beam_width:]
    
    # update start word
    start_word = start_word[-1][0]
    # intermediate caption
    intermediate_caption = [impression_tokenizer.index_word[i] for i in start_word]
    # generate final captions
    final_caption = []
    for i in intermediate_caption:
        if i != '<end>':
            final_caption.append(i)
        else:
            break
    # return final captions
    final_caption = final_caption[1:]
    final_caption.insert(0, '<start>')
    if len(final_caption) <= 39:
        final_caption.append('<end>')
    return final_caption

In [42]:
# get 5 random samples
data_eval = data.sample(5)
data_eval_vals = data_eval.values

# print beam search outputs
for val in data_eval_vals:
    print('___________________________________________________NEW__DATA__POINT___________________________________________________')
    result = beam_search(val[3], val[4], val[5], 5)
    print('Actual Impression ', val[5])
    print('Generated Impression ', result)

___________________________________________________NEW__DATA__POINT___________________________________________________
Actual Impression  <start> no acute disease <end>
Generated Impression  ['<start>', 'no', 'focal', 'areas', 'of', 'cystic', 'fibrosis', '<end>']
___________________________________________________NEW__DATA__POINT___________________________________________________
Actual Impression  <start> no acute findings <end>
Generated Impression  ['<start>', 'normal', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs', 'otherwise', 'clear', 'lungs']
___________________________________________________NEW__DATA__POINT___________________________________________________
Actual Impres

In [44]:
encoder1.save_weights(drive_path + '/encoder1_weights', save_format='tf')
encoder2.save_weights(drive_path + '/encoder2_weights', save_format='tf')
decoder.save_weights(drive_path + '/decoder_weights', save_format='tf')

encoder1.save_weights(drive_path + '/encoder1.h5')
encoder2.save_weights(drive_path + '/encoder2.h5')
decoder.save_weights(drive_path + '/decoder.h5')

## Conclusions

- In this Notebook we used the Bahdanau Attention and Teacher Forcing Technique to train the Models
- We used the GRU's as 
- We used Greedy Search and Beam Search to generate the Impression
- We also used Masked Loss in this notebook
- The Bleu Score that we got from our model is 0.44

## References

- https://www.tensorflow.org/tutorials/text/image_captioning#model
- https://www.tensorflow.org/tutorials/text/nmt_with_attention#write_the_encoder_and_decoder_model
- https://stackoverflow.com/questions/50786987/multiclass-classification-to-balance-in-python-over-sampling
- https://radimrehurek.com/gensim/models/word2vec.html
- https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
- https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
- https://github.com/nagapavan525/radiology-report-generation/blob/master/radiology_report_generation_final/Automated_Radiology_Report_Generation_attention%20(1).ipynb
- https://github.com/nagapavan525/radiology-report-generation/blob/master/NewIntegrationWithIndication/1_Capstone-Radiology-PreProcessing.ipynb
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- https://stackoverflow.com/questions/2612548/extracting-an-attribute-value-with-beautifulsoup
- https://stackoverflow.com/questions/24962673/beautiful-soup-getting-tag-id
- https://stackoverflow.com/a/47091490/4084039
- https://www.appservgrid.com/psam/Python_Samplifier--python1compute--Python_Program_to_Find_the_Size_(Resolution)_of_a_Image.html
- https://www.geeksforgeeks.org/working-images-python/
- https://gist.github.com/sebleier/554280
- https://stackoverflow.com/questions/27488446/how-do-i-get-word-frequency-in-a-corpus-using-scikit-learn-countvectorizer
- https://www.geeksforgeeks.org/python-remove-all-digits-from-a-list-of-strings/
- https://stackoverflow.com/questions/12851791/removing-numbers-from-string
- https://github.com/nagapavan525/radiology-report-generation/blob/master/radiology_report_generation_final/AutomatedRadiologyReportGenerationWithSentenceEmbeddings.ipynb
- https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/