# Symptom Identifier
This document contains exploration to build the NLP model to determine probability or similarity of how much given text indicates each symptom 

In [412]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import joblib

from gensim.models import Word2Vec, KeyedVectors
import gensim.downloader
from scipy.spatial.distance import cosine

In [51]:
import wikipedia as wiki
import re
from bs4 import BeautifulSoup

In [9]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import bigrams
from nltk.stem import WordNetLemmatizer

In [259]:
from textblob import TextBlob

# Extracting Symptoms
Now I will write some simple nlp steps to extract symptoms from descriptions. Until we collect enough text data, we will rely on the vector space provided by Spacy  to determine how closely a description is related to the each symptoms. We'll also see if the public domain medical transcription data will create a better vector space to identify the symptoms from descriptions.

In [576]:
# importing symptoms
symptoms = pd.read_json('data/symptoms.json', orient = 'table')

### Preprocessing
first, preprocessing steps for text input

In [382]:
def remove_punctuations(text, punctuations = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~�0123456789®'):
    ''' remove punctuations '''
    table_ = str.maketrans(punctuations, ' '*len(punctuations))
    return text.translate(table_)

def ascii_only(text):
    ''' remove non-ascii words '''
    return text.encode("ascii", "ignore").decode()

def lemmatize(word):
    ''' lemmatize text'''
    wnl = WordNetLemmatizer()
    return wnl.lemmatize(word)

#### Building Corpus
Initially, we will build the corpus using the relevant Wikipedia page. 

In [239]:
def get_words(pages, corpus = set()):
    print(f"Reading {len(pages)} pages...")
    for pg in pages: 
        print(pg, end = '')
        try:
            print(' | ', end = '')
            soup = BeautifulSoup(wiki.page(pg).html(), 'html.parser')
        except:
            print(' (error) | ', end = '')
            continue
        text = re.sub('\n', ' ', soup.get_text())
        text = re.sub("\[[^\[\]]*\]", " ", text)
        text = re.sub("{[^{}]*}", "", text)
        text = re.sub("\.[a-z0-9-]+", "", text)
        text = re.sub("References  .+", "", text)
        text = re.sub(r"([^A-Z-(])([A-Z])", r"\1 \2", text)
        text = re.sub("\xa0[0-9]*", " ", text)
        text = remove_punctuations(text, punctuations = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~�®')
        words = set(x.lower() for x in text.split() if len(x) > 1)
        corpus = corpus | words
    return corpus

In [247]:
def add_corpus(search_term, max_pages, corpus = set()):
    pages = wiki.search(search_term, results=max_pages)
    current_length = len(corpus)
    new_corpus = get_words(pages, corpus)
    print(f"Added {len(new_corpus) - current_length} words: Total {len(new_corpus)} words")
    return new_corpus

In [246]:
corpus = add_corpus("Neurofibromatosis", 100)

Reading 100 pages...
Neurofibromatosis | Neurofibromatosis type I | Neurofibromatosis type II | Café au lait spot | Legius syndrome | Gillian Anderson |  (error) | Neurofibroma | Genetic disorder | Neurofibromatosis type 3 | Watson syndrome | Malignant peripheral nerve sheath tumor | Adam Pearson (actor) | Neurofibromin 1 | Noonan syndrome | Lisch nodule | Neurofibromin | Glioblastoma | Scoliosis | Phakomatosis | Crowe sign | Neurofibromatosis type 4 | Dan Gilbert | Daniel Craig | Notching of the ribs | NF2 |  (error) | Merlin (protein) | Alcino J. Silva | Artaxerxes I | Katie Piper | Dural ectasia | Poliosis |  (error) | Children's Tumor Foundation | Neuroendocrine tumor | Microdeletion syndrome | Cherubism | Joseph Merrick | Hypertelorism | Ulisse Aldrovandi | Acute lymphoblastic leukemia | Facial weakness | Meningioma | Expressivity (genetics) | Brain tumor | Ependymoma | Friedrich Daniel von Recklinghausen | Frank W. Crowe | Point mutation | Intracranial aneurysm | Imatinib | Condi

Adding a generic symptom / medical vocabularies

In [248]:
corpus = add_corpus("Symptom", 20, corpus)

Reading 20 pages...
Signs and symptoms | Extrapyramidal symptoms | Aura (symptom) | Anorexia (symptom) | Symptom of the Universe | Somatic symptom disorder | COVID Symptom Study | Symptom Checklist 90 | B symptoms | Palliative care | Limited symptom attack | Prodrome | Vegetative symptoms | Parkinson's disease | Drug withdrawal | Functional symptom | Dissociation (psychology) | Schizophrenia | Constitutional symptoms | Depression (mood) | Added 2062 words: Total 22561 words


In [249]:
corpus = add_corpus("Health", 20, corpus)

Reading 20 pages...
Health | Health (film) | Mental health | Health care | Public health | World Health Organization | Syneos Health | Health Secretary | United States Department of Health and Human Services | Health (Health album) | Health assessment | What the Health | Health Net | Community health | United States Secretary of Health and Human Services | Reproductive health | Teladoc Health | Bausch Health | Health indicator | CVS Health |  (error) | Added 3751 words: Total 26312 words


In [250]:
corpus = add_corpus("Pain", 20, corpus)

Reading 20 pages...
Pain | Pain (disambiguation) |  (error) | Pain & Gain | To the Pain | T-Pain | The Pain | House of Pain |  (error) | Psychological pain | Knee pain | No Pain for Cakes | Pelvic pain | Abdominal pain | Chest pain | Analgesic | Low back pain | Brain Pain | Ear pain | Threshold of pain | No pain, no gain | War and Pain | Added 2597 words: Total 28909 words


In [254]:
corpus = add_corpus("Emotion", 10, corpus)

Reading 10 pages...
Emotion | Emotion classification | The Emotions | Emotion (disambiguation) |  (error) | Appeal to emotion | Fisker EMotion | Guilt (emotion) | Sweet Emotion | Passion (emotion) | Music and emotion | Added 1969 words: Total 28803 words


Drop if it does not contain any alphabet.

In [255]:
corpus = {z[0] for z in [re.findall("[a-z]+[a-z0-9-]*", x) for x in corpus] if z}

In [256]:
len(corpus)

28464

Lemmatizing to reduce dimensions

In [400]:
corpus = {lemmatize(x) for x in corpus}

#### Spelling
To increase the accuracy, we will apply an autocorrect model.

In [288]:
def spell_correct(text, corpus):
    ''' 
    INPUT: a string object 
    RETURN: new corrected string
    '''
    
    # for each word in text
    corrected = []
    for word in text.split():
        if word not in corpus:
            corrected.append(str(TextBlob(word).correct()))
        else:
            corrected.append(word)
    return ' '.join(corrected)

#### Sentence treatment
We identify symptoms per sentence. But not everyone writes a proper sentence or write one symptom per sentence. To account for this, we will create extra breaks. 
1. Break a sentence before `and`, `or`, `,`, `&`, `;`,

In [510]:
def break_sentence(text):
    '''
    INPUT: a string object
    RETURN: break sentence
    '''
    new_text = re.sub(r"\n", " ", text)
    return re.sub(r"(\W*)(and|or|,|&|;)(\W)+", r". \3", new_text)

In [587]:
def preprocess_full_text(text, corpus, sw = ['i', 'me', 'my', 'myself', 'we', 'our',
                         'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your',
                         'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she',
                         "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
                         'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll",
                         'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have',
                         'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but',
                         'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about',
                         'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
                         'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
                         'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
                         'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such',
                         'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
                         's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now',
                         'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't",
                         'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
                         "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't",
                         'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
                         "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't", 'felt', 'feel', 'feels'], 
                        sentence_break = True):
    '''
    Takes a text as an input
    Preprocess (remove punctuations, turn lower case, lemmatize, remove stop words)
    Return a dictionary of sentence tokens ({1: sentence}) and a nested tokens [[tokens], ...]
    '''
    if isinstance(text, str):
        text = ascii_only(text.lower())
        
        if sentence_break:
            text = break_sentence(text)
        
        text_tokens = []
        sent_tokens = {}
        for i, sentence in enumerate(sent_tokenize(text)): 
            cl_sentence = remove_punctuations(sentence)
            cl_sentence = spell_correct(cl_sentence, corpus)
            tokens = word_tokenize(cl_sentence)
            token_set = set(lemmatize(word) for word in tokens if word not in sw)
            
            # only keep words in the corpus
            token_set = corpus & token_set
            token_set = list(token_set)
            text_tokens.append(token_set)
            sent_tokens[i] = sentence
        return sent_tokens, text_tokens
    else: 
        return 'no input'

### Word2Vec training
Below is the method to create a vector space by training with word2vec. For the current version we will use the pre-trained model. But once we obtain enough text data, we can retrain the vector space with more targeted language.

In [485]:
# for now we will use GloVe 
# for initial run 
model = gensim.downloader.load('glove-wiki-gigaword-300')

In [486]:
model.save("model/word2vec.model")
model = KeyedVectors.load("model/word2vec.model")

In [487]:
model.init_sims(replace=True) # normalize if we need to retrain, remove replace

In [488]:
model.save("model/word2vec_norm.model")
model = KeyedVectors.load("model/word2vec_norm.model")

We now have a vector space. 
We'll use average vector of sentence to estimate. 

In [612]:
def get_avg_vectors(text, model, corpus, sentence_break = True):
    '''
    INPUT
    =====
    text: str
    model: Word2Vec embedding model
    RETURN
    =====
    
    '''
    if isinstance(text, str):
        sent_tokens, text_input = preprocess_full_text(text, corpus, sentence_break = sentence_break)
    else:
        print("Input text must be a string")
        return
        
    avg_vec = []
    
    for sentence in text_input:

        vectors = []
        
        for word in sentence:

            try:
                vectors.append(model[word])
                
            except KeyError:
                print(f'{word} does not exist.')
                continue
                
        avg = np.average(vectors, axis = 0)
        avg_vec.append(avg)
        
    return sent_tokens, avg_vec


In [613]:
symptom_names = symptoms.name.values

To-Do
Before getting the average vector, let's add a bit more information so to bias the description more.

In [614]:
_, symptom_vectors = get_avg_vectors('. '.join(symptom_names), model, corpus, 
                                  sentence_break = False)

neurofibroma does not exist.
neurofibroma does not exist.
neurofibroma does not exist.
neurofibroma does not exist.
mpnst does not exist.
moyamoya does not exist.


In [616]:
symptom_vectors = dict(zip(symptom_names, symptom_vectors))

In [617]:
joblib.dump(symptom_vectors, 'symptom_vectors')

['symptom_vectors']

In [618]:
# for each sentence, see how close they are to target symptoms

def identify_symptom(text, symptom_vectors, model, corpus, threshold = 0.5):
    '''
    Find the closest symptom per sentences
    '''
    sent_tokens, avg_vec = get_avg_vectors(text, model, corpus)
    reference = {} # {1: {sentence: identified symptom}}
    pred_symptoms = {}

    for i, sent_vec in enumerate(avg_vec): 
        
        # for each sentence
        max_ = threshold

        for symptom, sym_vec in symptom_vectors.items():
            # get cosine similarity
            similarity =  1 - cosine(sent_vec, sym_vec)
            # find the highest similarity
            if similarity > max_:

                max_ = similarity
                max_symptom = symptom
        
        
        if max_ > threshold:
            if max_symptom in pred_symptoms: 
                
                # if symptom already exists, update if similarity is higher
                if max_ > pred_symptoms[max_symptom] : 
                    pred_symptoms[max_symptom] = max_

            else: 
                # add symptom if it does not exist
                pred_symptoms[max_symptom] = max_
                
        reference[i] = {sent_tokens[i] : max_symptom}

    if pred_symptoms: 
        return [k for k, v in sorted(pred_symptoms.items(), key = lambda item: item[1])], reference        
    else: 
        # Flag if no symptom was detected.
        return ['NA'], reference

In [620]:
a, b = get_avg_vectors(text, model, corpus)

In [623]:
text = "I have a high fever and headache. Things look a little strange. Pain on the left side of the body. Also felt a bit of numbness. It seems like I had hard time moving my limbs"
result, reference = identify_symptom(text, symptom_vectors, model, corpus)
result

['Vision changes',
 'Numbness',
 'Problem with movement',
 'High blood pressure',
 'Pain',
 'Headaches or migraines']

In [624]:
reference

{0: {'i have a high fever.': 'High blood pressure'},
 1: {'headache.': 'Headaches or migraines'},
 2: {'things look a little strange.': 'Vision changes'},
 3: {'pain on the left side of the body.': 'Pain'},
 4: {'also felt a bit of numbness.': 'Numbness'},
 5: {'it seems like i had hard time moving my limbs': 'Problem with movement'}}

In [926]:
def return_symptom_id(symptom_list, keys):
    return [keys[keys.symptom == x].index[0] for x in symptom_list]

In [927]:
# return symptom_id for this
result_symptom_id = return_symptom_id(result, keys)
result_symptom_id

[12, 25, 24, 21, 58]

For the next iteration, we can present users with the select list of symptoms so they can provide a feedback as to how accurate our model is, then retrain based on their answer.

# Testing
Testing using medical transcription data

In [5]:
mt = pd.read_csv('data/medical_transcription_samples.csv', index_col = 0)

In [537]:
mt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4999 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        4999 non-null   object
 1   medical_specialty  4999 non-null   object
 2   sample_name        4999 non-null   object
 3   transcription      4966 non-null   object
 4   keywords           3931 non-null   object
dtypes: object(5)
memory usage: 234.3+ KB


In [None]:
# preprocessing all transcriptions
text_input = [preprocess(x) for x in mt.transcription]

In [590]:
from itertools import chain
# unnesting once
text_input = list(chain(*text_input))

In [591]:
len(text_input)

140476

In [None]:
# this is when we train our own vector space using text_input preprocessed above
#model = Word2Vec(sentences = text_input)