# Symptom Identifier
This document contains exploration to build the NLP model to determine probability or similarity of how much given text indicates each symptom 

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import joblib

from gensim.models import Word2Vec, KeyedVectors
import gensim.downloader
from scipy.spatial.distance import cosine

In [18]:
import wikipedia as wiki
import re
from bs4 import BeautifulSoup

In [19]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import bigrams
from nltk.stem import WordNetLemmatizer

In [20]:
from textblob import TextBlob

# Extracting Symptoms
Now I will write some simple nlp steps to extract symptoms from descriptions. Until we collect enough text data, we will rely on the vector space provided by Spacy  to determine how closely a description is related to the each symptoms. We'll also see if the public domain medical transcription data will create a better vector space to identify the symptoms from descriptions.

In [21]:
# importing symptoms
symptoms = pd.read_json('data/symptoms.json', orient = 'table')

### Preprocessing
first, preprocessing steps for text input

In [22]:
def remove_punctuations(text, punctuations = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~�0123456789®'):
    ''' remove punctuations '''
    table_ = str.maketrans(punctuations, ' '*len(punctuations))
    return text.translate(table_)

def ascii_only(text):
    ''' remove non-ascii words '''
    return text.encode("ascii", "ignore").decode()

def lemmatize(word):
    ''' lemmatize text'''
    wnl = WordNetLemmatizer()
    return wnl.lemmatize(word)

#### Building Corpus
Initially, we will build the corpus using the relevant Wikipedia page. 

In [23]:
def get_words(pages, corpus = set()):
    print(f"Reading {len(pages)} pages...")
    for pg in pages: 
        print(pg, end = '')
        try:
            print(' | ', end = '')
            soup = BeautifulSoup(wiki.page(pg).html(), 'html.parser')
        except:
            print(' (error) | ', end = '')
            continue
        text = re.sub('\n', ' ', soup.get_text())
        text = re.sub("\[[^\[\]]*\]", " ", text)
        text = re.sub("{[^{}]*}", "", text)
        text = re.sub("\.[a-z0-9-]+", "", text)
        text = re.sub("References  .+", "", text)
        text = re.sub(r"([^A-Z-(])([A-Z])", r"\1 \2", text)
        text = re.sub("\xa0[0-9]*", " ", text)
        text = remove_punctuations(text, punctuations = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~�®')
        words = set(x.lower() for x in text.split() if len(x) > 1)
        corpus = corpus | words
    return corpus

In [24]:
def add_corpus(search_term, max_pages, corpus = set()):
    pages = wiki.search(search_term, results=max_pages)
    current_length = len(corpus)
    new_corpus = get_words(pages, corpus)
    print(f"Added {len(new_corpus) - current_length} words: Total {len(new_corpus)} words")
    return new_corpus

In [None]:
stop here. skip to load corpus

In [12]:
corpus = add_corpus("Neurofibromatosis", 100)

Reading 100 pages...
Neurofibromatosis | Neurofibromatosis type I | Neurofibromatosis type II |  (error) | Café au lait spot | Gillian Anderson |  (error) | Legius syndrome | Neurofibroma | Noonan syndrome | Neurofibromatosis type 3 | Genetic disorder | Adam Pearson (actor) | Malignant peripheral nerve sheath tumor | Lisch nodule | Neurofibromin 1 | Phakomatosis | Watson syndrome | Neurofibromatosis type 4 | Neurofibromin | Glioblastoma | Crowe sign | Dan Gilbert | Daniel Craig | NF2 | 



  lis = BeautifulSoup(html).find_all('li')


 (error) | Notching of the ribs | Katie Piper | Microdeletion syndrome | Scoliosis | Poliosis |  (error) | Dural ectasia | Alcino J. Silva | Hypertelorism | Artaxerxes I | Cherubism | Neuroendocrine tumor | Facial weakness | Joseph Merrick | Merlin (protein) | Children's Tumor Foundation | Acute lymphoblastic leukemia | Brain tumor | Ulisse Aldrovandi | Expressivity (genetics) | List of skin conditions | Point mutation | Optic nerve glioma | Friedrich Daniel von Recklinghausen | Conditions comorbid to autism spectrum disorders | Imatinib | Noonan syndrome with multiple lentigines | Spinal tumor | Heterochromia iridum | RASopathy | Ependymoma | Chiasmal syndrome | Frank W. Crowe | Vestibular schwannoma | NF |  (error) | Otology | Proteus syndrome | Achaemenid Empire | Jar City (film) | Auditory brainstem implant | Sarcoma | Cro-Magnon rock shelter | Meningioma | Huang Chuncai | Birthmark | Penetrance |  (error) | Schwannoma | Hypotonia | Type 2 | Crowe | Schwannomatosis | NF1 | Pilocyti

Adding a generic symptom / medical vocabularies

In [10]:
corpus = add_corpus("Symptom", 20, corpus)

Reading 20 pages...
Signs and symptoms | Extrapyramidal symptoms | Aura (symptom) | Symptom of the Universe | Somatic symptom disorder | Anorexia (symptom) | Palliative care | Symptom Checklist 90 | B symptoms | COVID Symptom Study | Functional symptom | Drug withdrawal | Vegetative symptoms | Dissociation (psychology) | Limited symptom attack | Schizophrenia | Prodrome | Constitutional symptoms | Parkinson's disease | Depression (mood) | Added 2026 words: Total 22267 words


In [11]:
corpus = add_corpus("Health", 20, corpus)

Reading 20 pages...
Health | Health (film) | Mental health | Health care | Public health | World Health Organization | Syneos Health | Health insurance | Environmental health | United States Department of Health and Human Services | Health indicator | Reproductive health | Health assessment | Health Secretary | Health professional | Health (Health album) | Bausch Health | CVS Health |  (error) | Health benefit |  (error) | Health board |  (error) | Added 3434 words: Total 25701 words


In [12]:
corpus = add_corpus("Pain", 20, corpus)

Reading 20 pages...
Pain | Pain (disambiguation) |  (error) | Pain & Gain | T-Pain | To the Pain | Pain management | Power and Pain | Psychological pain | Pain stimulus | Knee pain | Neuropathic pain | Low back pain | Pelvic pain | Radicular pain | Analgesic | Bedabrata Pain | No Pain for Cakes | Abdominal pain | The Pain | Chest pain | Added 2730 words: Total 28431 words


In [13]:
corpus = add_corpus("Emotion", 10, corpus)

Reading 10 pages...
Emotion | Emotion classification | The Emotions | Emotion (disambiguation) |  (error) | Appeal to emotion | Emotion recognition | Art and emotion | Guilt (emotion) | Fisker EMotion | Passion (emotion) | Added 1457 words: Total 29888 words


Drop if it does not contain any alphabet.

In [14]:
corpus = {z[0] for z in [re.findall("[a-z]+[a-z0-9-]*", x) for x in corpus] if z}

In [15]:
len(corpus)

27644

Lemmatizing to reduce dimensions

In [16]:
corpus = {lemmatize(x) for x in corpus}

In [17]:
#joblib.dump(corpus, 'data/corpus')

['data/corpus']

In [25]:
corpus = joblib.load('data/corpus')

#### Spelling
To increase the accuracy, we will apply an autocorrect model.

In [29]:
def spell_correct(text, corpus):
    ''' 
    INPUT: a string object 
    RETURN: new corrected string
    '''
    
    # for each word in text
    corrected = []
    for word in text.split():
        if word not in corpus:
            corrected.append(str(TextBlob(word).correct()))
        else:
            corrected.append(word)
    return ' '.join(corrected)

#### Sentence treatment
We identify symptoms per sentence. But not everyone writes a proper sentence or write one symptom per sentence. To account for this, we will create extra breaks. 
1. Break a sentence before `and`, `or`, `,`, `&`, `;`,

In [30]:
def break_sentence(text):
    '''
    INPUT: a string object
    RETURN: break sentence
    '''
    new_text = re.sub(r"\n", ". ", text)
    return re.sub(r"(\W*)(and|or|,|&|;)(\W)+", r". \3", new_text)

In [31]:
def preprocess_full_text(text, corpus, sw = ['i', 'me', 'my', 'myself', 'we', 'our',
                         'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your',
                         'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she',
                         "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
                         'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll",
                         'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have',
                         'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but',
                         'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about',
                         'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
                         'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
                         'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
                         'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such',
                         'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
                         's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now',
                         'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't",
                         'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
                         "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't",
                         'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
                         "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't", 'felt', 'feel', 'feels'], 
                        sentence_break = True):
    '''
    Takes a text as an input
    Preprocess (remove punctuations, turn lower case, lemmatize, remove stop words)
    Return a dictionary of sentence tokens ({1: sentence}) and a nested tokens [[tokens], ...]
    '''
    if isinstance(text, str):
        text = ascii_only(text.lower())
        
        if sentence_break:
            text = break_sentence(text)
        
        text_tokens = []
        sent_tokens = {}
        for i, sentence in enumerate(sent_tokenize(text)): 
            cl_sentence = remove_punctuations(sentence)
            cl_sentence = spell_correct(cl_sentence, corpus)
            tokens = word_tokenize(cl_sentence)
            token_set = set(lemmatize(word) for word in tokens if word not in sw)
            
            # only keep words in the corpus
            token_set = corpus & token_set
            token_set = list(token_set)
            text_tokens.append(token_set)
            sent_tokens[i] = sentence
        return sent_tokens, text_tokens
    else: 
        return 'no input'

### Word2Vec training
Below is the method to create a vector space by training with word2vec. For the current version we will use the pre-trained model. But once we obtain enough text data, we can retrain the vector space with more targeted language.

In [32]:
# for now we will use GloVe 
# for initial run 
# model = gensim.downloader.load('glove-wiki-gigaword-300')

In [33]:
# model.save("model/word2vec.model")
# model = KeyedVectors.load("model/word2vec.model")

In [34]:
# model.init_sims(replace=True) # normalize if we need to retrain, remove replace

In [35]:
# model.save("model/word2vec_norm.model")
model = KeyedVectors.load("model/word2vec_norm.model")

We now have a vector space. 
We'll use average vector of sentence to estimate. 

In [36]:
def get_avg_vectors(text, model, corpus, sentence_break = True):
    '''
    INPUT
    =====
    text: str
    model: Word2Vec embedding model
    RETURN
    =====
    
    '''
    if isinstance(text, str):
        sent_tokens, text_input = preprocess_full_text(text, corpus, sentence_break = sentence_break)
    else:
        print("Input text must be a string")
        return
        
    avg_vec = []
    
    for sentence in text_input:

        vectors = []
        
        for word in sentence:

            try:
                vectors.append(model[word])
                
            except KeyError:
                print(f'{word} does not exist.')
                continue
                
        avg = np.average(vectors, axis = 0)
        avg_vec.append(avg)
        
    return sent_tokens, avg_vec


In [37]:
# Selecting only symptoms (adding a bit more details to help with vector averaging)
symptoms = {'Spots': 'Spots, specks on skin', 
            'Itching': 'Itching',
            'Cognitive Difficulties': 'Thinking Difficulties', 
            'Vision Changes': 'Vision changes', 
            'Fractures' : 'Broken bone, fractures', 
            'Pain': 'pain',
            'Bowel or bladder control problems': 'bowel or bladder control problem', 
            'Breathing problems': 'breathing problems',
            'Problem with movement': 'problem with movement, balance', 
            'Numbness': 'loss of sensation, numbness, tingling, pins-and-needles, numb', 
            'Learning difficulties': 'learning disabilities',
            'Attention issues': 'attention issues, adhd, focusing', 
            'Nosebleed': 'nosebleed, bleeding nose', 
            'Heart Problem': 'heart problem, pounding in the chest',
            'High blood pressure': 'high blood pressure',
            'Chewing or swallowing problems': 'difficulty chewing or swallowing, dysphagia', 
            'Constipation': 'constipation, pooping', 
            'Poor weight gain': 'poor weight gain',
            'Gastroesophageal reflux': 'digestive disorder, gastroesophageal reflux disease, gerd, acidic stomach juices, acid refulx', 
            'Anxiety': 'anxiety, fear, worry, panic, attack, agoraphobia, phobias', 
            'Arthritis': 'arthritis, swelling and tenderness of joints, stiff limbs, pain', 
            'Depression': 'depression',
            'Difficulties with social interactions': 'difficulties with social interactions', 
            'Fatigue': 'fatigue',
            'Headaches or migraines': 'headache or migraines, face pain', 
            'Joint pain': 'joint pain, hip, shoulders, elbows, knees, wrists, ankles', 
            'Loose joints': 'loose joints, hypermobility, range of motion, hip, shoulders, elbows, knees, wrists, ankles', 
            'Muscle coordination issues': 'muscle coordination issues, ataxia',
            'Other mental health problems': 'other mental health problems, ptsd, psychosis', 
            'Seizures or epilepsy': 'seizure of epilepsy', 
            'Sleep disturbances': 'sleep disturbances, insomnia', 
            'Fever': 'fever'}     

##### Problem to solve
It needs to categorize other symptoms not-relevant to NF.  
(e.g. "high fever" should not be categorized as "high blood pressure")  
It also needs to weigh heavier on the word combo. 
(e.g. "my head hurts" should be closer to 'headache' than 'large head size')

In [None]:
stop here. skip to load the symptom_vectors

In [38]:
_, symptom_vectors = get_avg_vectors('. '.join(symptoms.values()), model, corpus, 
                                  sentence_break = False)

In [39]:
len(symptom_vectors) == len(symptoms.keys())

True

In [40]:
symptom_vectors = dict(zip(symptoms.keys(), symptom_vectors))

In [41]:
joblib.dump(symptom_vectors, 'symptom_vectors')

['symptom_vectors']

In [44]:
symptom_vectors = joblib.load('symptom_vectors')

In [69]:
# for each sentence, see how close they are to target symptoms
### To-Do: update to include all symptoms with high similarity, order them
### Must return - output list, second likely batch, and the reference (sentence to symptom match)

def identify_symptom(text, symptom_vectors, model, corpus, threshold = 0.5):
    '''
    Find the closest symptom per sentences
    '''
    sent_tokens, avg_vec = get_avg_vectors(text, model, corpus)
    reference = {} # {1: {sentence: [(symptom, similarity), ...]}
    #pred_symptoms = {} 

    for i, sent_vec in enumerate(avg_vec): 
        
        similarity_list = []
        
        for symp, sym_vec in symptom_vectors.items():
            # get similarity between symptom and sentence
            similarity = 1 - cosine(sent_vec, sym_vec)
            if similarity > threshold: 
                similarity_list.append((symp, round(similarity, 3)))
            
        reference[i] = {sent_tokens[i] : [x for x in sorted(similarity_list, 
                                                            key = lambda v: v[1], 
                                                            reverse = True)]}
    # get the highest similarity list
    vals = [x[0] for x in [list(x.values()) for x in reference.values()] if x[0]]

    # check if it predicted ANY symptom
    if vals: 
        first_options = set(x[0][0] for x in vals)
        second_options = set(x[1][0] for x in vals if len(x) > 1)
        second = second_options - first_options
        return first_options, len(second) > 0 and second or None, reference
        
    else: 
        print('no symptom was detected')
        return ['No symptom detected'], None, reference

In [70]:
# testing

In [82]:
text = '''
I feel fine
I have a high fever and headache
My head hurts
Things look a little strange
Breathing was harder than usual
Pain on the left side of the body. Also felt a bit of numbness
It seems like I had hard time moving my limbs
'''
first, second, reference = identify_symptom(text, symptom_vectors, model, corpus)

In [83]:
first, second

({'Breathing problems',
  'Fever',
  'Headaches or migraines',
  'Numbness',
  'Pain',
  'Problem with movement',
  'Vision Changes'},
 {'Arthritis',
  'Cognitive Difficulties',
  'Heart Problem',
  'High blood pressure'})

In [84]:
reference

{0: {'.': []},
 1: {'i feel fine.': []},
 2: {'i have a high fever.': [('Fever', 0.783),
   ('High blood pressure', 0.656)]},
 3: {'headache.': [('Headaches or migraines', 0.814),
   ('Arthritis', 0.545),
   ('Pain', 0.511)]},
 4: {'my head hurts.': []},
 5: {'things look a little strange.': [('Vision Changes', 0.54),
   ('Cognitive Difficulties', 0.531)]},
 6: {'breathing was harder than usual.': [('Breathing problems', 0.709),
   ('Cognitive Difficulties', 0.584),
   ('Poor weight gain', 0.505)]},
 7: {'pain on the left side of the body.': [('Pain', 0.659),
   ('Heart Problem', 0.656),
   ('Joint pain', 0.606),
   ('Arthritis', 0.584),
   ('High blood pressure', 0.576),
   ('Headaches or migraines', 0.565),
   ('Loose joints', 0.552),
   ('Spots', 0.539),
   ('Problem with movement', 0.537),
   ('Fractures', 0.508),
   ('Poor weight gain', 0.508),
   ('Nosebleed', 0.507)]},
 8: {'also felt a bit of numbness.': [('Numbness', 0.585),
   ('Pain', 0.561),
   ('Arthritis', 0.558),
   ('Br

### Return results as index

Pending -- need to decide how to (or whether to) store all similarities or not

In [85]:
def return_symptom_id(symptom_list, keys):
    return [keys[keys.symptom == x].index[0] for x in symptom_list]

In [927]:
# return symptom_id for this
result_symptom_id = return_symptom_id(result, keys)
result_symptom_id

[12, 25, 24, 21, 58]