## Utilizing Untagged Medical Literature for Diagnoses using Word Embeddings
##### 
### Submitted By:
#### Ben Muller
#### Sukumar Hakhoo

<br>

### Abstract

The project is aimed to establish and evaluate a methodology to computationally consume medical literature and draw certain results based upon it. We intend to construct the project around a symptom-disease paradigm, employing NLP techniques to traverse through large quantities of textual data. As a reference for evaluating our findings, we considered COVID-19 along with a dataset comprising literature around COVID-19 and related diseases.

<br>

### Data Load

In [1]:
import json
import gensim

from pathlib import Path
from scipy.spatial.distance import cosine
from nltk.tokenize import sent_tokenize, word_tokenize
from gensim.models import Word2Vec

from pprint import pprint

import pandas as pd

In [2]:
def create_paper_dict(paper):
    """
    Reads in a research paper and returns a dictionary containing the paper ID, abstract, and main text.
    Input: research paper --> JSON file
    Output: {paper_id: , abstract: , body_text: } --> dictionary
    """
    paper_dict = {}
    abstract = ''
    text = ''
    
    try:  # many papers don't have abstracts
        for i in paper['abstract']:
            abstract += i['text']
    except:
        pass
    for i in paper['body_text']:
        text += i['text']
    
    paper_dict['paper_id'] = paper['paper_id']
    paper_dict['abstract'] = abstract
    paper_dict['body_text'] = text
    
    return paper_dict


# data_path = 'C://Users//Binyamin//PythonProjects//NLP//final_project//data//'
data_path = 'data'
lit = []

# Searches recursively through Repo for .json files and creates a list of dictionary from them.
pathlist = Path(data_path).glob('**/*.json')
for path in pathlist:
    path_in_str = str(path)  # because path is object not string
    with open(path_in_str) as f:
        data = json.load(f)
    paper_dict = create_paper_dict(data)
    lit.append(paper_dict)

In [3]:
len(lit)

788

#### Literature - Text Sample

In [4]:
lit[0]['body_text'][: 963]

'It is highly contagious, and severe cases can lead to acute respiratory distress or multiple organ failure [3] . On 11 March 2020, the WHO has made the assessment that COVID-19 can be characterised as a pandemic. As of , in total, 1,391,890 cases of COVID-19 have been recorded, and the death toll has reached 81,478 with a rapid increase of cases in Europe and NorthAmerica.8th April 2020The disease can be confirmed by using the reverse-transcription polymerase chain reaction (RT-PCR) test [4] . While being the gold standard for diagnosis, confirming COVID-19 patients using RT-PCR is time-consuming, and both high false-negative rates and low sensitivities may put hurdles for the presumptive patients to be identified and treated early [3] [5] [6] .As a non-invasive imaging technique, computed tomography (CT) can detect those characteristics, e.g., bilateral patchy shadows or ground glass opacity (GGO), manifested in the COVID-19 infected lung [7] [8] .'

<br>

### Collating all the papers

In [5]:
def collate_papers(lit):
    papers = []

    for paper in lit[: 400]:
        papers.append(paper['body_text'])

    papers_joined = ' '.join(papers)
    
    return papers_joined

<br>

### Cleaning, formatting and tokenizing the data

In [6]:
import re
from spacy.lang.en.stop_words import STOP_WORDS

In [7]:
def clean(sentence):
    sentence = sentence.lower().strip()
    sentence = re.sub(r'[^a-z0-9\s]', '', sentence)
    return re.sub(r'\s{2,}', ' ', sentence)

def format_data(data):
    data_2 = data.lower()
    data_3 = data_2.replace("covid 19", "covid19").replace("coronavirus", "covid19").replace("corona virus", "covid19").replace("covid-19", "covid19")
    return data_3

def get_tokens(data):
    data_formatted = format_data(data)
    tokenized_data = []
    for text in sent_tokenize(data_formatted):
        sentence = []
        for word in word_tokenize(text): 
            sentence.append(word.lower()) 
        tokenized_data.append(sentence)
    return tokenized_data

def tokenize_and_exclude_stop(data):
    data_formatted = format_data(data)
    return [token for token in data_formatted.split() if token not in STOP_WORDS]

### Single Word Embeddings

#### Hyper-Parameter Tuning

In [8]:
tuning_set = collate_papers(lit[: 500])

In [9]:
tuning_set_tokens = get_tokens(tuning_set)

In [10]:
## size (int, optional) – Dimensionality of the word vectors.
## window (int, optional) – Maximum distance between the current and predicted word within a sentence.
## min_count (int, optional) – Ignores all words with total frequency lower than this.
## workers (int, optional) – Use these many worker threads to train the model
## sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.

params = {
    "size": [100, 200, 300],
    "window": [4, 5, 6],
    "min_count": [1, 2, 4],
    "sg": [0, 1]
}

avg_similarity = 0

config = {
    "size": 300,
    "window": 5,
    "min_count": 1,
    "sg": 1
}

for s in params["size"]:
    for w in params["window"]:
        for m in params["min_count"]:
            for s_g in params["sg"]:
                model = gensim.models.Word2Vec(tuning_set_tokens, min_count = m, size = s, window = w, sg = s_g, workers=4)
                av = (model.wv.similarity('covid19', 'contagious') + model.wv.similarity('covid19', 'contagious'))/2
                if av > avg_similarity:
                    config["size"], config["window"], config["min_count"], config["sg"] = params["size"], params["window"], params["min_count"], params["sg"]

<br>

#### Computing the Embeddings

In [11]:
all_papers = collate_papers(lit)

In [12]:
all_papers_tokenized = get_tokens(all_papers)

In [13]:
single_word_model = Word2Vec(all_papers_tokenized, min_count = config["min_count"], size = config["size"], window = config["window"], sg = config["sg"], workers=4)

In [14]:
# Checking the vectors
print(single_word_model.wv['covid19'])

[ 3.98642004e-01 -4.90289629e-02 -2.72744328e-01 -9.34871286e-02
  5.56738675e-02 -2.27702707e-01  7.60516524e-02 -1.16866022e-01
  4.56629833e-03 -1.47687525e-01 -9.63129103e-02 -3.59593391e-01
 -6.10619523e-02  2.42732882e-01 -4.35110554e-03  1.06496245e-01
 -2.32995719e-01  2.94045597e-01 -2.15488151e-02 -4.90947247e-01
  3.42011005e-01  3.64054143e-02  1.75463215e-01  1.54116884e-01
 -1.23976292e-02 -3.69291544e-01  1.05059762e-02  4.86418992e-01
 -6.19644374e-02  1.13475747e-01  5.77975400e-02  2.17263058e-01
  1.76578552e-01 -1.47122458e-01 -8.13105553e-02  9.96737778e-02
  1.54389232e-01 -4.26848941e-02  4.48325843e-01 -4.14113790e-01
 -4.98222589e-01  3.66290301e-01 -2.27412611e-01  1.52295634e-01
 -4.21023458e-01 -2.37446729e-04 -3.13821554e-01  1.33714676e-01
  2.67412901e-01  1.61777735e-01  3.68710041e-01  1.98089093e-01
 -2.07741916e-01 -3.72700810e-01 -2.68490583e-01 -6.04679734e-02
  1.07121244e-02  5.00592217e-02 -9.68020782e-02  7.58925900e-02
 -2.36507490e-01  2.50098

In [15]:
# Checking the vector dimension
len(single_word_model.wv['covid19'])

300

<br>

### Phrase embeddings

#### Method 1

In [18]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser

In [19]:
bigram_transformer = Phrases(all_papers_tokenized)

In [20]:
phrase_model = Word2Vec(bigram_transformer[all_papers_tokenized], min_count = config["min_count"], size = config["size"], window = config["window"], sg = config["sg"], workers=4)

In [23]:
print(phrase_model.wv['dry_cough'])

[-2.21277289e-02  2.18219738e-02 -5.30181266e-02 -5.76215759e-02
 -4.02793959e-02 -4.40687537e-02 -6.62262831e-03  1.16354063e-01
 -4.05478776e-02  5.01777157e-02 -7.51697272e-02 -1.25397578e-01
 -7.42818266e-02 -3.62891555e-02  6.66416734e-02  1.17392182e-01
 -1.30534021e-03  1.66229114e-01 -4.96773496e-02  1.51526630e-01
  7.44620711e-02 -7.52144903e-02  8.69593471e-02 -3.19533772e-03
 -2.56789420e-02 -7.11770728e-03 -1.13122193e-02  1.88548505e-01
 -1.72108188e-02  1.63151041e-01  4.41005342e-02 -5.68961278e-02
 -4.76506948e-02 -9.97022763e-02  3.69683281e-02 -1.51246712e-02
 -2.18057598e-04  3.48617956e-02 -1.11959251e-02 -3.19869258e-02
 -7.58144110e-02  1.67135298e-01 -1.19034596e-01  6.01153858e-02
 -1.90544784e-01  4.50925417e-02 -1.35599613e-01 -8.24983139e-03
 -2.46521104e-02  1.10970333e-01  1.16444314e-02  1.66229948e-01
  1.32694338e-02 -5.62148839e-02 -1.30581826e-01  7.22940192e-02
 -1.59410983e-02  3.75487027e-03 -6.92248940e-02 -9.35636908e-02
 -1.01436831e-01  1.67719

#### Find the phrases in the Embeddings

In [24]:
def get_phrases(model):
    keys = list(model.wv.vocab.keys())
    phrases = []
    for k in keys:
        if '_' in k:
            phrases.append(k)
    print("No. of phrases = " + str(len(phrases)))
    return phrases

In [26]:
phrases = get_phrases(phrase_model)

No. of phrases = 5506


### Method 2

In [27]:
def sentence_to_bi_grams(phrases_model, sentence):
    return ' '.join(phrases_model[sentence])

def sentences_to_bi_grams(n_grams, document):
    output = []
    for sentence in document:
        clean_text = clean(sentence)
        tokenized_text = tokenize_and_exclude_stop(clean_text)
        parsed_text = sentence_to_bi_grams(n_grams, tokenized_text)
        output.append(parsed_text)
    return output

def build_phrases(sentences):
    phrases = Phrases(sentences,
                      min_count=5,
                      threshold=7,
                      progress_per=1000)
    return Phraser(phrases)

In [28]:
all_papers_tokenized_2 = tokenize_and_exclude_stop(all_papers)

In [31]:
bigram_transformer_2 = Phrases(all_papers_tokenized_2)

In [32]:
phrase_model_2 = Word2Vec(bigram_transformer_2[all_papers_tokenized_2], min_count = config["min_count"], size = config["size"], window = config["window"], sg = config["sg"], workers=4)

<br>

### Saving and loading the models

In [34]:
single_word_model.save('single_word_model.model')
phrase_model.save('phrases_model.model')
phrase_model_2.save('phrases_model_2.model')

In [35]:
# phrases_model = Phraser.load('phrases_model_3')

<br>

### Evaluation

In [37]:
def diagnose(symptoms, diseases):
    """
    Takes in list of symptoms and list of diseases (maybe make global) and produces avg similarities 
    between to each disease.
    
    Param: symptoms --> list
    Param: diseases --> list
    Output: sims --> dict{similarity: disease}
    """
    sims = {}
    for i in diseases:
        cos_list = []
        for j in symptoms:
            cos_list.append(cosine(we_dict[i], we_dict[j]))
        avg_cos = sum(cos_list)/len(cos_list)
        sims[avg_cos] = i
        
    return sims
    
# sims = diagnose(symptoms, diseases)
# top_diagnosis = sims[min(sims.keys())]
# top_5 = [sims[x] for x in sorted(sims.keys())[:5]]

#### Single Word Embedding Evaluations

In [38]:
single_word_model.wv.similarity('covid19', 'contagious')

0.60248464

In [39]:
single_word_model.wv.similarity('covid19', 'cough')

0.5687578

In [40]:
single_word_model.wv.n_similarity(['covid19', 'temperature'], ['positive', 'high'])

0.6645982

In [41]:
single_word_model.wv.n_similarity(['covid19', 'temperature'], ['positive', 'low'])

0.6454363

In [42]:
single_word_model.wv.n_similarity(['covid19', 'cough'], ['positive', 'dry'])

0.68116194

#### Phrase Embeddings Evaluations

In [43]:
phrase_model.wv.n_similarity(['covid19', 'temperature'], ['positive', 'high'])

0.66259634

In [44]:
phrase_model.wv.n_similarity(['covid19', 'temperature'], ['positive', 'low'])

0.62701315

In [45]:
phrase_model.wv.similarity('covid19', 'high_temperature')

0.4462875

In [46]:
phrase_model.wv.similarity('covid19', 'dry_cough')

0.45954078