## Leveraging a Knowledge Base of Covid-19 Terms to Construct a Bayesian Risk Model

In this notebook, we will rely on topic modeling to extract all clinically relevant articles. Afterwards, we’ll utilize the articles to build a knowledge base of clinical terms. The knowledge base will be constructed in a semi-unsupervised manner. It will categorize our medical terms into classes, such as _symptoms_, and _comorbidities_. We’ll subsequently use the knowledge base to power an intelligent search tool. Our tool will leverage known medical concepts to return highly-relevant results for risk-factor queries. Based on these results, we will construct a baseline Bayesian model for identifying high-risk patients. Using our knowledge base, we’ll build a wrapper function around our model, so that it can be applied to unstructured text. Our final result will be a baseline model for prediction patient risk, based on a written description of the patient.

Here is an outline of the notebook:

1. **Data Exploration**
2. **Topic Modeling**
    1. Topic Visualization
    2. Ranking Topics by Task Relevance
3. **Exploring Clinically Relevant Articles**
4. **Extracting Medical Terms from Relevant Sentences**
5. **Exploring the Extracted Medical Terms**
6. **Building a Knowledge Base of Medical Terms**
    1. Uncovering the Symptoms of Covid-19
    2. Uncovering Covid-19 Comorbidities
7. **Building a KB-Powered Search Tool to Probe for Risks**
8. **Building a Classifier to Predict Covid-19 Progression**

Let’s get started!

# 1. Data Exploration

We’ll start by loading the metadata into a Pandas DataFrame. Afterwards, we’ll count the number of records, while sampling the first few rows.

In [None]:
import pandas as pd

df = pd.read_csv('../input/CORD-19-research-challenge/metadata.csv', low_memory=False)
record_count = df.shape[0]
print(f"Our metadata includes {record_count} total records.")
df.head(3)

We have over 47,000 records. Some of the records lack an abstract. This is a problem. Our initial plan is to cluster our records based on abstract similarity. Let’s filter out the records lacking an abstract. 

In [None]:
df = df.dropna(subset=['abstract'])
percent_abstracts = 100 * df.shape[0] / (record_count)
print(f"{percent_abstracts:.2f}% of our records contain abstracts.")

18% of our records lack an abstract. We’ve filtered these out. Most of the initial records remain. Are any of these records duplicates? Let’s find out.

In [None]:
df[['title', 'abstract']].describe(include=['O'])

Approximately 38.6K of our 39K abstracts are unique. The rest are duplicates. We’ll delete these duplicates below.

In [None]:
df.drop_duplicates(subset='abstract', keep="first", inplace=True)
percent_remaining = 100 * df.shape[0] / (record_count)
print(f"{percent_remaining:.2f}% of our records remain after duplicate deletion.")

After the deletion, 81.75% of our initial records remain. Now, we’ll filter these records even further. Our goal is to explore SARS / Coronavirus risks. Thus, we want to ensure that every record mentions a SARS / Coronavirus related-term. We check for relevant mentions by applying a regular expression to the title and the abstract of each record. All records that lack a relevant match will be deleted.

In [None]:
import re
from functools import partial


def compile_pattern_list(pattern_list, add_bounderies=True, escape=True):
    """This function compiles an aggregated regular expression from a list of string
    patterns. It is frequently used elsewhere within this notebook.
    
    Parameters
    ----------
    pattern_list: [str]:
        A list of strings to be OR-ed together into a single regular expression.
    add_bounderies (optional): Bool
        If True, then word boundaries are added to the regular expression.
    escape (optional): Bool
        If True, then we run `re.escape` on each pattern string, to ensure proper compilation.
        
    Returns
    -------
    regex: re.Pattern
        A compiled regular expression object.
    """
    if escape:
        pattern_list = [re.escape(p) for p in pattern_list]
    
    regex_string = r'|'.join(pattern_list)
    if add_bounderies:
        regex_string = r'\b%s\b' % regex_string
    
    return compile_(regex_string)

# Compiles a case-insenstiive regular expression. Re-used frequently elsewhere in the Notebook.
compile_ = partial(re.compile, flags=re.I)

pattern_list = ['SARS', 'Severe Acute Respiratory Syndrome', 'COVID-19', 'SARS-COV', 
                'coronavirus', 'corona virus', '2019-nCoV', 'SARS-CoV-2']

regex = compile_pattern_list(pattern_list)
is_match = [any([regex.search(e) for e in [title, abstract]])
            for title, abstract in df[['title', 'abstract']].values]
df = df[is_match]

print(f"{df.shape[0]} of our records reference either SARS or some coronavirus.")

After filtering, approximately 10,000 records remain. We’ll now run topic modeling on the records. Afterwards, we’ll focus our attention on those topics that relate to risk-factor analysis.

# 2. Topic Modelling

There are [multiple techniques](https://towardsdatascience.com/2-latent-methods-for-dimension-reduction-and-topic-modeling-20ff6d7d547) for topic modelling. [LDA](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158) is one very popular approach. However, the quality of the LDA outputs vary depending on the algorithm’s implementation. There are two ways to compute LDA’s posterior distribution: [MCMC and Variational Inference](https://towardsdatascience.com/bayesian-inference-problem-mcmc-and-variational-inference-25a8aa9bce29). MCMC yields accurate results but will run slowly on 5,000 samples. Meanwhile Variational Inference (VI)  is much faster. Unfortunately, based on my experience, the VI topic outputs tend to be of substandard quality.  Hence, when I run topic-modeling, I prefer the following approach:

1. Extract initial topics using simple [LSA](https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python).

2. Cluster the LSA outputs for improved topic results.

Let’s prepare to execute LSA. First, we’ll vectorize our 5,000 abstracts, yielding a matrix of TFIDF vectors. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
# This matrix has been normalized under default settings
tfidf_matrix = vectorizer.fit_transform(df.abstract)

Now, we’ll finish LSA by dimensionally reducing the matrix using Scikit-Learn’s `TruncatedSVD` class. We’ll set the number of reduced dimensions to 100, per Scikit-Learn’s recommended [LSA parameter settings](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html).

In [None]:
import numpy as np
from sklearn.decomposition import TruncatedSVD
# Truncated SVD is a stochastic algorithm. We set the random seed to ensure a consistant output.
np.random.seed(0)
lsa_matrix = TruncatedSVD(n_components=100).fit_transform(tfidf_matrix)


Essentially, LSA assigns a dimensionally-reduced vector to each abstract. We can leverage these vectors to further cluster our abstracts by topic. Personally, my preference is to cluster our LSA matrix in a completely unsupervised way, using the [markov clustering](https://medium.com/analytics-vidhya/demystifying-markov-clustering-aeb6cdabbfc7). However, good markov clustering requires a time-consuming [matrix filtration step](https://academic.oup.com/bioinformatics/article/27/3/326/319615) to be effective. Therefore, in this exercise, we’ll carry out a simpler text clustering approach:

1. First, we’ll normalize our vectors. 
2. Next, we’ll carry out K-means. Since our normalized vectors are now points along a hypersphere, our K-means clustering output should be reliable (see [here](https://livebook.manning.com/book/data-science-bookcamp/chapter-15/v-3/206) for additional theoretical justifications).

Below, we’ll execute these steps, using an arbitrary K of 20. If necessary, we’ll adjust that K during our topic-exploration phase.

In [None]:
np.random.seed(0)
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize

clusters = KMeans(n_clusters=20).fit_predict(normalize(lsa_matrix))
df['Index'] = range(clusters.size)
df['Cluster'] = clusters
# Clusters are stored as DataFrames for easier analysis.
cluster_groups = [df_cluster for  _, df_cluster in df.groupby('Cluster')]

### 2.1 Topic Visualization

Now, we are ready to visualize our clusters. I prefer to visualize NLP-based clusters using word clouds. Some members of the Data Science community look down upon word clouds. However, in my opinion, they can be an amazing visualization tool if their input is tweaked properly. Here is my preferred technique for word-cloud-based text-cluster visualization.

Select the indices of the documents that are present in a single cluster.
Sum-up the TFIDF vectors associated with that index list. These sums allow us to rank the words in our vocabulary based on their importance to the cluster. In my experience, these  TFIDF rankings are much more useful than the standard frequency rankings. See [here](https://livebook.manning.com/book/data-science-bookcamp/chapter-15/v-3/249) for more details.
Visualize a word cloud containing top 10 ranked words within the cluster. The size of each word in the cloud is determined by its summed TFIDF value.

Let’s implement this strategy by defining a `cluster_to_image` to function. The function will convert a single text cluster into a word-cloud image. We’ll then apply that function to visualize the first of our 20 clusters.

In [None]:
np.random.seed(0)
import matplotlib.pyplot as plt
from wordcloud import WordCloud


def cluster_to_image(df_cluster, max_words=10, tfidf_matrix=tfidf_matrix,
                     vectorizer=vectorizer):
    """This function converts a text-cluster into a word-cloud image.

    Parameters
    ----------
    df_cluster: DataFrame
        This DataFrame contains the document ids associated with a single cluster.
    max_words (optional): int
        The number of top-ranked words to include in the word-cloud image.
    tfidf_matrix (optional): csr_matrix
        A matrix of TFIDF values. Each row i corresponds to the ith abstract.
    vectorizer (optional): TfidfVectorizer
        A vectorizer object that tracks our vocabulary of words.
    
    Returns
    -------
    word_cloud_image: WordCloud
        A word-cloud image containing the top words within the cluster.

    """
    indices = df_cluster.Index.values
    summed_tfidf = np.asarray(tfidf_matrix[indices].sum(axis=0))[0]
    data = {'Word': vectorizer.get_feature_names(),'Summed TFIDF': summed_tfidf}  
    # Words are ranked by their summed TFIDF values.
    df_ranked_words = pd.DataFrame(data).sort_values('Summed TFIDF', ascending=False)
    words_to_score = {word: score
                     for word, score in df_ranked_words[:max_words].values
                     if score != 0}
    
    # The word cloud's color parameters are modefied to maximize readability.
    cloud_generator = WordCloud(background_color='white',
                                color_func=_color_func,
                                random_state=1)
    wordcloud_image = cloud_generator.fit_words(words_to_score)
    return wordcloud_image

def _color_func(*args, **kwargs):
    # This helper function will randomly assign one of 5 easy-to-read colors to each word.
    return np.random.choice(['black', 'blue', 'teal', 'purple', 'brown'])

wordcloud_image = cluster_to_image(cluster_groups[0])
plt.imshow(wordcloud_image, interpolation="bilinear")
plt.show()

Our first visualized cluster describes [bat-born](https://en.wikipedia.org/wiki/Bat-borne_virus#Coronaviruses) coronaviruses. What sort of information is found in the other 19 clusters? Let's find out. We'll proceed to visualize all clusters in a 5-by-4 subplot grid.

In [None]:
def plot_wordcloud_grid(cluster_groups, num_rows=5, num_columns=4):
    # This function plots all clusters as word-clouds in 5x4 subplot grid.
    figure, axes = plt.subplots(num_rows, num_columns, figsize=(20, 15))
    cluster_groups_copy = cluster_groups[:]
    for r in range(num_rows):
        for c in range(num_columns):
            if not cluster_groups_copy:
                break
                
            df_cluster = cluster_groups_copy.pop(0)
            wordcloud_image = cluster_to_image(df_cluster)
            ax = axes[r][c]
            ax.imshow(wordcloud_image, interpolation="bilinear")   
            # The title of each subplot contains the cluster id, as well as the cluster size.
            ax.set_title(f"Cluster {df_cluster.Cluster.iloc[0]}: {df_cluster.shape[0]}")
            ax.set_xticks([])
            ax.set_yticks([])

plot_wordcloud_grid(cluster_groups)
plt.show()

Our 20 clusters cover a diverse collection of topics. These include:

* The [Feline Coronavirus](https://en.wikipedia.org/wiki/Feline_coronavirus), which thus far is only found in cats (Cluster 2).
* Efforts to develop a SARS vaccine (Cluster 5).
* The deadly [MERS](https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome) Coronavirus, originating in the Middle East, after a Camel-to-Person transmission (Cluster 6). 
* [RNA virus](https://en.wikipedia.org/wiki/RNA_virus) genetics (Cluster 7).
* Respiratory infections by the [Avian Coronavirus (IBV)](https://en.wikipedia.org/wiki/Avian_coronavirus), which originates in chickens (Cluster 9).
* The [protease cleavage](https://en.wikipedia.org/wiki/Protease#Viruses) of Coronavirus genomes into functional units, and how it could be inhibited (Cluster 12).
* The [fusion membrane proteins](https://en.wikipedia.org/wiki/Viral_protein#Viral_membrane_fusion_proteins)  used by the coronavirus to get inside our cells (Cluster 18).

Each of these clusters are interesting, and is worth exploring. However, the goal of this exercise is to analyze clinical risk factors, not membrane proteins or protease cleavage. Hence, many of the clusters are not relevant to the task at hand. Can we somehow rank the clusters based on their relevance to the task? Yes we can! We just need to use text similarity.

### 2.2 Ranking Topics by Task Relevance
We want to compare our clusters to our assigned Kaggle task. Let’s start by storing the description of the task within a string.

In [None]:
task_string = """Task Details
What do we know about COVID-19 risk factors? What have we learned from epidemiological studies?

Specifically, we want to know what the literature reports about:

Data on potential risks factors
Smoking, pre-existing pulmonary disease
Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
Neonates and pregnant women
Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
Susceptibility of populations
Public health mitigation measures that could be effective for control"""

> Next, we’ll compute a vector of cosine similarities between each abstract and the task string.

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
matrix = vectorizer.fit_transform(list(df.abstract) + [task_string])
similarities_to_task = matrix[:-1] @ matrix[-1].T

Our `similarities_to_task` array contains the task similarities across all clustered abstracts. For any given cluster, we can combine these cosine similarities into a relevance score, by taking their mean. This allows us to sort all clusters by relevance. Let’s sort the 20 clusters and re-plot the results.

In [None]:
def compute_mean_similarity(df_cluster):
    # Computes the mean cosine similarity of a cluster to the task string.
    indices = df_cluster.Index.values
    return similarities_to_task[indices].mean()

mean_similarities = [compute_mean_similarity(df_cluster) 
                     for df_cluster in cluster_groups]
sorted_indices = sorted(range(len(cluster_groups)),
                        key=lambda i: mean_similarities[i], reverse=True)
sorted_cluster_groups = [cluster_groups[i] for i in sorted_indices]
plot_wordcloud_grid(sorted_cluster_groups)
plt.show()

The four most relevant clusters appear in the top row of our sorted results. These clusters focus on the clinical consequences of the coronavirus outbreak. The relevant clusters are dominated by terms such as _patients_, _health_, _disease_, _covid-19_, _hospital_,  and _clinical_. Beyond the top row, our relevancy rapidly decreases. The subsequent two clusters in the second row focus on flu infections, and on MERS. However, they don’t reference Covid-19. Hence, in our analysis, we’ll focus on just the clusters in Row One. All other clustered papers will deleted.

In [None]:
df = pd.concat(sorted_cluster_groups[:4])
print(f'Our 4 most relevant clusters cover {df.shape[0]} articles.')

The top four clusters contain 2,786 clinically relevant articles articles. Our intermediate goal is to build a smart search engine that’s powered by these articles. However, we will first need to explore these relevant articles in more detail. 

# 3. Exploring Clinically Relevant Articles



In [None]:
df_relevant = pd.concat(sorted_cluster_groups[:4])
print(f'Our 4 most relevant clusters cover {df.shape[0]} relevant articles.')

All remaining articles are clinically relevant. However, not all articles are necessarily relevant to the 2020 coronavirus Pandemic. Some articles may refer to the Pandemics of the past, such as the SARS outbreak of 2003. Additionally, some articles may focus on flu-outbreaks, while mentioning coronaviruses only in passing. Therefore, to maintain relevance, we need to filter the articles even further. Let’s only keep those articles that reference the Pandemic in their titles.

In [None]:
# These phrases are explicity associated with the SARS-Cov-2 Pandemic.
pattern_list = ['COVID-19', 'novel coronavirus', 'novel corona virus', '2019-nCoV', 'SARS-CoV-2']
regex = compile_pattern_list(pattern_list)

df = df[[regex.search(t) is not None for t in df.title.values]]
print(f"{df.shape[0]} of our clinically relevant records reference the Pandemic in their title.")

Over 1000 articles remain. We’ll explore these articles on a sentence-by-sentence level. To start, we’ll tokenize all abstracts into sentences with the spaCy NLP library. Some of our tokenized sentences might be duplicates, containing slight variations of spacing or punctuation. As a precautionary measure, we will eliminate such duplicates from our results.


In [None]:
import spacy  
import string
nlp = spacy.load("en_core_web_sm")  

abstract_sentences = set() # Contains all unique sentences.
seen_sentences = set() # Used to tract duplicates.
duplicate_count = 0 # Used to count duplicates.

trans_table = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
def simplify_text(text):
    # Removes all spacing, casing and punctuation, for better duplicate detection.
    return text.translate(trans_table).lower().replace(' ', '')

for abstract in df.abstract.values:
    for sentence in nlp(abstract).sents:
        simplified_text = simplify_text(sentence.text)
        if simplified_text not in seen_sentences:
            # Sentence is not a duplicate.
            seen_sentences.add(simplified_text)
            abstract_sentences.add(sentence.text)
            
        else:
            duplicate_count += 1
            
num_unique = len(abstract_sentences)
percent_unique = 100 * num_unique / (num_unique + duplicate_count)
print(f"Our 1334 abstracts contain {num_unique} unique sentences.")
print(f"{int(percent_unique)}% of all sentences are unique.")

89% of the abstract sentences are unique. This leaves us with 13.2K unique sentences total. We can increase our sentence count by also including all full articles that are associated with `df`. Let’s now load the file paths to all relevant article JSONs at our disposal.

In [None]:
import glob
df_full = df[df.has_pdf_parse == True]
relevant_fnames  = {sha + '.json' for sha in df_full.sha.values}
relevant_fpaths = [p for p in glob.glob('../input/CORD-19-research-challenge/**/*.json', recursive=True)
                   if p.split('/')[-1] in relevant_fnames]

Next, we’ll parse all relevant JSONs and extend our `all_sentences` list with the tokenized full-article texts.

In [None]:
import json

body_sentences = set()
for file_path in relevant_fpaths:
    with open(file_path) as f:
        data = json.load(f)
        full_text = ' '.join(t['text'] 
                             for t in data['body_text'])
        
        for sentence in nlp(full_text).sents:
            simplified_text = simplify_text(sentence.text)
            if simplified_text not in seen_sentences:
                # Sentence is not a duplicate.
                seen_sentences.add(simplified_text)
                body_sentences.add(sentence.text)
                
all_sentences = abstract_sentences | body_sentences
print(f"Article body inclusion gives us {len(all_sentences)} unique sentences.")

We’ve obtained nearly 100K unique sentences. However, not all our sentences are equally relevant. Consider for instance, those sentences that mention _Diabetes_. They are quite different from sentences that mention _Ebola_. Why? Well, Diabetes is a well-known comorbidity of Covid-19. Meanwhile, Ebola carries very little relevance to the current Pandemic. It’s only mentioned in passing, in reference to previous types of outbreaks. We can confirm this by printing five random sentences that mention Diabetes. Afterwards, we’ll print five Ebola sentences for comparison.

In [None]:
def search(regex_string):
    # Returns sentences that match the inputted regex.
    regex = compile_(regex_string)
    return sorted([s for s in all_sentences if regex.search(s)])

diabetes_sentences = search('Diabetes')
ebola_sentences = search('Ebola')

print('DIABETES:')
for sentence in diabetes_sentences[-10:]:
    print(sentence)
    
print('\nEBOLA')
for sentence in ebola_sentences[-10:]:
    print(sentence)

The Diabetes sentences are obviously more relevant. They frequently mention other types of Covid-19 comorbidities and complications. Such mentions are occasionally accompanied by percentages corresponding to measurements. The mentioned terms are important enough to be measured across patients. Meanwhile, the Ebola sentences are much less important. Hence, they lack any sort of measured percentages. Of course, we’ve sampled 20 sentences total. Perhaps the observed percentage trend will not hold. Let’s count the number of Diabetes matches that contain percentages. We’ll then compare this value to the Ebola-percentage total.

In [None]:
num_percent_diabetes = len([s for s in diabetes_sentences if '%' in s])
num_percent_ebola = len([s for s in ebola_sentences if '%' in s])

print(f"{num_percent_diabetes} of the {len(diabetes_sentences)} Diabetes sentences contain a percentage.")
print(f"{num_percent_ebola} of the {len(ebola_sentences)} Ebola sentences contain a percentage")

Nearly one-third of the Diabetes sentences contain a percentage. Meanwhile, less than 2% of our less-relevant Ebola sentences contain a measured percentege. This provides some evidence for the following hypothesis: **Sentences with percentages are more likely to be relevant than sentences that lack percentages**. How many of our sentences contain percentages? We’ll now find out.

In [None]:
percent_sentences = {s for s in all_sentences if '%' in s}
print(f"{len(percent_sentences)} of our {len(all_sentences)} sentences contain a percentage.")

Over 7% of our sentences contain a percentage. We hypothesize that these sentences are clinically relevant. Hence, the medical terms within the sentences should be clinically relevant. We’ll now proceed to extract these medical terms from the texts in `percent_sentences`. Afterwards, we’ll leverage these terms in order to populate our knowledge base.

# 4. Extracting Medical Terms from Relevant Sentences

Clinical terms such as _Diabetes_ and _Hypertension_, appear as noun chunks within our relevant sentences. We can extract all noun chunks from a sentence with the help of the spaCy library. Below, we will define an `extract_noun_chunks` function, that we’ll then apply to an example sentence.

In [None]:
from collections import defaultdict
def extract_noun_chunks(text):
    # Returns a set of unique noun chunks present in a text.
    text = remove_parans(text) # Strips our parenthesized text for better noun chunk parsing.
    return {noun_chunk.text.lower()
            for noun_chunk in nlp(text).noun_chunks}

def remove_parans(text):
    # Removes short stretches of parenthesized text for more accurate parsing.
    return re.sub(r'\((.{,20})\)', '', text).replace('  ', ' ')

text = ("Most of the infected patients were men (30 [73%] of 41); less than half "
        "had underlying diseases (13 [32%]), including diabetes (eight [20%]), "
        "hypertension (six [15%]), and cardiovascular disease (six [15%]).'")
extract_noun_chunks(text)

All the relevant medical terms in our example sentence are included within the extracted noun chunks. However, not all these noun-chunks are medical terms. Noun chunks such as _less than half_ and _men_ are a source of noise in our results. We need a way of extracting just the medical texts while ignoring all other results. One very simple way to extract medical phrases is with the [WordNet](https://en.wikipedia.org/wiki/WordNet) lexical database, which can be accessed using [NLTK](https://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html). WordNet allows us to access the hypernyms of any given word. A word’s hypernym is a broad ontological category to which that word belongs. For instance, both _Diabetes_ and _Hypertension_ fall under the hypernym of _Disease_. By leveraging a set of common medical hypernyms, we can define an `is_medical_term` function. The function extracts all hypernyms for each word within a text, and returns `True` if any hypernym is medical.

In [None]:
from nltk.corpus import wordnet as wn 
# Common medical hypernyms
medical_hypernyms = {'symptom', 'ill_health', 'negative_stimulus',
                    'bodily_process', 'disease', 'illness', 'pain',
                    'body_substance', 'medicine', 'drug', 'pathology',
                    'breathing', 'therapy', 'medical_care', 'treatment',
                    'disorder'}

def is_medical(text):
    # Returns True any an hypernym of an word in `text` overlaps with `medical_hypernyms`.
    for word in text.lower().split():
        hypernyms = set(get_hypernyms(word))
        if hypernyms & medical_hypernyms:
            return True
        
    return False 

def get_hypernyms(word):
    # Returns a set of hypernyms associated with each word.
    hypernym_list = []
    # Accesses all synsets (synonyms and and usage-categories) of word
    synsets = wn.synsets(word)
    count = 0
    while synsets and count < 4:
        # Extracts all hypernyms of most common sysnset. We limit ourselves to just the 
        # first 4 layers of the hypernym hierarchy.
        hypernyms = synsets[0].hypernyms()
        hypernym_list.extend([h.name().split('.')[0]
                              for h in hypernyms])
        synsets = hypernyms
        count += 1
    return hypernym_list


assert is_medical('Diabetes')
assert is_medical('Breathing problems')
assert not is_medical('problems')
assert not is_medical('men')

The main downside of hypernym analysis is lack of precision. The hypernyms of each word are analyzed individually, without regard to context. This leads to unexpected errors. For instance, consider the previously discussed phrase _less than half_. It just so happens WordNet records the word _less_ as a synonym of Lupus. Thus, _disease_ is erroneously recorded as a hypernym of _less_. Consequently, our `is_medical` function erroneously treats the phrase as a medical term.


In [None]:
print(f"First synset of 'less':\n{wn.synsets('less')[0]}")
print(f"\nHypernyms of 'less':\n{get_hypernyms('less')}")
assert is_medical('less than half')

We need a more precise way of validating medical terms. There’s one approach that guarantees high precision; scraping Wikipedia! Medical Wikipedia pages are very easy to spot. They almost always contain a _Medical Specility_ field within their info box. With this in mind, we’ll now define a `search_wikipedia` function. The function will take as input a noun-chunk. Afterwards, it will scrape the Wikipedia page associated with that noun chunk. If the page is not found, the function will return `None`. Otherwise, the function will look for a specialty field in the info-box. If no specialty is found, we’ll return None. Otherwise, we will return the medical specialty, along with the page title. That title will allow us to gauge differences between the inputted noun chunk and the final redirected results.

In [None]:
import requests
from bs4 import BeautifulSoup as bs

def search_wikipedia(noun_chunk):
    """Searches Wikipedia for a noun chunk. If a noun chunk is associated with a medical 
    specialty, then  the function returns both the specialty and the page title. Otherwise,
    it returns a None"""
    
    # Heuristic cleaning is carried out on the noun chuck
    name = _clean_string(noun_chunk)
    # The Wikipedia page that we'll try to load.
    url = f'https://en.wikipedia.org/wiki/{name.replace(" ", "_")}'
    response = _scrape_url(url)
    if response is None:
        # No page loaded.
        return None
    
    # We parse the page using Beautiful Soup.
    soup = bs(response.text)
    table = soup.find('table', class_='infobox')
    if not table:
        # No Info Box has been found.
        return None
    
    specialty = None
    for table_row in table.find_all('tr'):
        text = table_row.text
        if text.startswith('Specialty'):
            # We've uncovered a medical specialy.
            specialty = text[len('Specialty'):].strip()
            break
    
    if not specialty:
        # No specialty has been found.
        return None
    
    # We clean the title, prior to returning the medical specialy and the title.
    title = _clean_title(soup.title.text, name)
            
    return specialty, title

def _clean_string(text):
    # Filters out common patterns at the start of medical strings that interfere with Wiki crawling.
    deletion_patterns = [r'^(a|an|the)\b',  r'^(mild|low|moderate|severe|high|old)\b']
    for regex_string in deletion_patterns:
        text = re.sub(regex_string, '', text, flags=re.I).strip()
        
    return text


def _scrape_url(url):
    # Scrapes the url.
    for _ in range(3):
        # Repeat 3 times in case of timeout.
        try:
            response = requests.get(url, timeout=10)
            # Return scraped response if page has loaded.
            return response if response.status_code == 200 else None
        
        except timeout:
            continue
    
    return None

def _clean_title(title, name):
    # Removes noise from the end of the string.
    title = title.replace('- Wikipedia', '').strip()
    # Sometimes symptom-terms redirect to diease pages. This could lead to confusion later in our
    # analysis, when we execute symptom extraction. Hence, we use a regex to deal with this edge-case.
    if title.endswith('disease') and re.search(r'symptoms?$', name):
        return name.capitalize()
        
    return title

for term in ['a moderate fever', 'dyspnea', 'diabetes', 'hypertension']:
    print(f"Searching for '{term}'.")
    specialty, alias = search_wikipedia(term)
    print(f"The term is also know as '{alias}.'")
    print(f"Its associated clinical specialties are: {specialty}.\n")
    
assert search_wikipedia('less than half') is None

Using Wikipedia to uncover medical terms carries multiple benefits over hypernym strategy.

1. Term identification is more precise. 
2. Wikipedia redirects allow us to identify multiple aliases for the same medical terms. We thus can leverage these aliases to group synonymous medical terms together.
3. Accessing the clinical specialty allows us to group the terms by specialty-type (lung disease, heart-disease, etc). This allows us to group the  medical terms by specialty, for a more nuanced investigation.

Unfortunately the strategy also has its downsides. The running-time for crawling individual wikipedia pages is not very efficient. Ideally, we’d have access to a local Wiki data-dump in order to speed-up the query time. However, storing that data would require many gigabytes of memory. Currently a 20 GB partial Wikipedia dump is [available on Kaggle](https://www.kaggle.com/jkkphys/english-wikipedia-articles-20170820-sqlite) Regrettably, that data does not include redirects or info-box information. Hence, we’ll need to scrape Wikipedia ourselves. It takes about 0.5 seconds to scrape an individual page. Hence, it takes on average 1-3 seconds to crawl all the noun-chunks in a sentence. Of course, we could speed things up using concurrency. However, given the resource limitations of our Kaggle notebook, concurrency should be avoided. With this in mind, we’ll comprise and execute the following strategy.

* We’ll extract all unique noun-chunks from the percentage sentences found in abstracts.
* We’ll query the noun-chunks against Wikipedia, thus obtaining a subset of medical terms (while also tracking their associated clinical specialties).
* For every medical term, we’ll obtain the list of all percentage sentences that mention that term. Both abstract and body sentences will be included in the list.

Of course, we will miss clinical terms that are mentioned in the body but not in the abstract. However, the importance of the missed terms should be minimal, since important observations are usually referenced in the abstract.

Let’s now use Wikipedia to extract medical terms from our relevant sentences.

In [None]:
from collections import defaultdict
# A mapping between medical terms and sentence mentions.
medical_terms = defaultdict(list)
# A mapping between medical terms and the associated page-titles / specialties 
# that were scraped from Wikipedia
term_to_wiki_data = {}
# A cache of all encountered noun chunks that are not medical terms.
bad_noun_chunks = set()

# We filter out certain noun-chunks to speed-up search results. This includes numeric nouns
# associated with percentages as well as various variations of the term SARS.
bad_patterns = [r'respiratory(\s+distress)? syndrome', r'[0-9]']
bad_regex = compile_pattern_list(bad_patterns, escape=False, add_bounderies=False)

for sent in percent_sentences & abstract_sentences:
    # We start by iterating over the relevant sentences from the abstracts.
    for noun_chunk in extract_noun_chunks(sent):
        if noun_chunk in bad_noun_chunks or bad_regex.search(noun_chunk):
            continue
                
        if noun_chunk in medical_terms:
            # This noun chunk is a known medical term. We append this sentence to that term's 
            # associated sentence list.
            medical_terms[noun_chunk].append(sent)
            continue

        # We do a wiki-search for previously unseen noun-chunk
        wiki_result = search_wikipedia(noun_chunk)
        if wiki_result is not None:
            # New medical term discovered using Wikipedia.
            medical_terms[noun_chunk].append(sent)
            term_to_wiki_data[noun_chunk] = wiki_result      
        else:
            # We cache the non-medical term.
            bad_noun_chunks.add(noun_chunk)
            
for sent in percent_sentences & body_sentences:
    for noun_chunk in extract_noun_chunks(sent):
        # We iterate over the relevant sentences in the article bodies and subsequently
        # update our sentence mentions.
        if noun_chunk in medical_terms:
            medical_terms[noun_chunk].append(sent)
        

# 5. Exploring the Extracted Medical Terms
To start, we’ll count the total number of medical terms. We’ll also output the top 10 most frequently mentioned terms.



In [None]:
print(f"We've obtained {len(medical_terms)} medical terms.")
print("The top 10 terms / mention-counts are:\n")
for term, sentences in sorted(medical_terms.items(), 
                              key=lambda x: len(x[1]), reverse=True)[:10]:
    print(term, len(sentences))

We’ve extracted 168 medical terms. Not surprisingly, the most frequently mentioned term happens to be _fever_. Let’s output a few random fever mentions.

In [None]:
for sentence in medical_terms['fever'][:5]:
    print(sentence)

Our mentions contain fluctuating percentages that detail the observed fraction of feverish patients. Of course, not all percentages are associated with fever. Some percentages pertain to other symptoms and observations. Is there a way to extract these percentages automatically, such that each percentage maps to its associated medical term? Yes! If we leverage the BERT-SQuAD model for question-answering.

## 5.1 Using BERT-SQUAD to Extract Measured Percentages

Suppose we have a sentence that mentions a fever. We want to know the percentage that’s associated with fever. How do we obtain that percentage? By asking!! Using the [BERT-SQuAD](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15848021.pdf) model, we can run question-answering on [any](https://towardsdatascience.com/testing-bert-based-question-answering-on-coronavirus-articles-13623637a4ff) on any text of our choosing. So we can ask “What percentage is associated with Fever?” and obtain an answer. Or, better yet, we can simply ask “% Fever?” and still obtain a result. Below, we load a [Hugging Face](https://huggingface.co/transformers/) SQuAD-trained transformer model in order to define an `answer_question` function. Afterwards, we query an example fever mention in order to obtain the fever percentage.

In [None]:
!pip install --upgrade pytorch_transformers

In [None]:
import torch
from transformers import DistilBertTokenizer                    
from transformers import DistilBertForQuestionAnswering
                                           
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad') 
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')
def answer_question(question, text):
    # Uses DistilBERT SQuaD trained model for question-answering.
    input_text = f"{question} [SEP] {text}" 
    input_ids = tokenizer.encode(input_text) 
    start_scores, end_scores = model(torch.tensor([input_ids])) 
    all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
    # Strips out pound-signs and extra spaces returned by the model.
    answer = answer.replace(' ##', '').strip()
    return answer

text = ("The most common symptoms at the onset of illness were fever (82.1%), cough (45.8%), "
       "fatigue (26.3%), dyspnea (6.9%) and headache (6.5%).")
answer_question("Percent fever?", text)

Our strategy worked! Let’s now define a task-specific `ask_percentage` function. The function will query an input text for the percentage that’s associated with an input term. If found, the extracted percentage will be converted to a numeric quantity.

In [None]:
def ask_percentage(term, text):
    # Asks for the percentage that's associated with an input medical term.
    question = f'Percent {term}?'
    answer = answer_question(question, text)
    # Strips out extra-spaced returned by DistilBERT model.
    answer = answer.replace(' %', '%').replace(' . ', '.')
    # Extracts the percentage from the asnwer, if any.
    percentage = _extract_percentage(answer)
    # Converts the percentage into a float.
    return float(percentage[:-1]) if percentage else None
    
regex_percent = compile_( r'\b[0-9]{1,2}(\.[0-9]{1,2})?%')
def _extract_percentage(text):
    # Uses a regex to extract a percentage from the text.
    match = regex_percent.search(text)
    return text[match.start(): match.end()] if match else None 

ask_percentage('fever', text)

Given our list of fever mentions, we can attempt to extract a fever percentage for each mention. Afterwards, we can compute a single representative percentage using either the median or mean of our results. Subsequently, we can choose a representative mention whose percentage matches our representative value. We’ll now define a `match_percentage` function that will execute these steps.

In [None]:
def match_percentages(term, text_list):
    """The function returns a representative percentage of a term across of a list of mentions, along
    with its representative mention.
    
    Parameters
    ----------
    term: str
        A clinical term whose percentage we wish to obtain (Example: Fever)
    text_list: [str]
        A list of of text mention. Each metion potentially contains percentage match to `term`.
    
    Results
    -------
    representative_percentage: float
        An extracted percentage that is closest to the median.
    representative_mention: str
        The element of `text_list` in which `representative_percentage` has been found.
    """
    matches = []
    for text in text_list:
        percentage = ask_percentage(term, text)
        if percentage is not None and percentage != 0.95:
            # We filter all references to 95%. These mostly correspond with statistical significance,
            # and not with observed fequencies.
            matches.append((percentage, text))
    
    if not matches:
        # No percentages have been found.
        return None, None
    
    percentages = np.array([m[0] for m in matches])
    # We use the median instead of the mean, due to risk of overweighed extreme values.
    dist_to_median = np.abs(percentages - np.median(percentages))
    return matches[dist_to_median.argmin()]
    
rep_percentage, rep_mention = match_percentages('fever', medical_terms['fever'])
print(f"\nThe representative percentage of fever is {rep_percentage}%.")
print(f"It occurs in the following sentence:\n{rep_mention}")

We’ve obtained a single fever percentage, along with its associated sentence. However, our analysis of the _Fever_ term is incomplete. Currently, there exist multiple terms within our `medical_terms` dictionary that redirect to the _Fever_ Wiki-page. All these terms are essentially aliases of _Fever_. Hence, their mentions should be combined into a single `all_fever_sentences` list. Afterwards, we’ll want to re-run `match_percentage` on that list.

In [None]:
all_fever_sentences = []
fever_aliases = []

_, fever_title =  term_to_wiki_data['fever']
# The term 'fever' redirects to a Wiki=page titled 'Fever'/
assert fever_title == 'Fever'

for term, sentences in medical_terms.items():
    _, wiki_title = term_to_wiki_data[term]
    if wiki_title == fever_title:
        fever_aliases.append(term)
        all_fever_sentences.extend(sentences)


print(f"The following {len(fever_aliases)} terms are aliases of Fever:\n {fever_aliases}")

rep_percentage, rep_mention = match_percentages('fever', all_fever_sentences)
print(f"\nThe representative percentage of Fever is {rep_percentage}%.")
print(f"It occurs in the following sentence:\n{rep_mention}")

We’ve obtained a representative percentage for all our fever mentions. In the process, we have also aggregated all the distinct alises of the symptom under a single _Fever_ name. Now, we will apply a similar strategy to all our remaining medical terms. For every given term we will:

1. Uncover all the aliases that redirect to the same Wiki-page as the term.
2. Aggregate together all the aliases and assign them their representative Wiki-page title.
3. Compute a representative percentage and a representative mention.
4. Store the results in a Pandas DataFrame.

The DataFrame will include aliases, and the representative percentages. It will also include mention counts. Additionally, we’ll add the Wikipedia-extracted medical specialty to the DataFrame, for further analysis.


In [None]:
# Maps a medical word to its aliases, based on wikipedia redirects.
aliases = defaultdict(list)
# An aggregation of all mentions that's associated with a term
aggregated_sentences = defaultdict(list)
for term, sentences in medical_terms.items():
    _, name = term_to_wiki_data[term]
    aliases[name].append(term)
    aggregated_sentences[name].extend(sentences)
    
table = {'Name': [], 'Count': [], 'Specialties': [], 'Percentage': [],
         'Percentage Text': [], 'Aliases': []}

# Sorting ensures constitancy of indices in DataFrame.
for name, alias_list in sorted(aliases.items()):
    # Iterate over each aggregated medical concept.
    sentences = aggregated_sentences[name]
    count = len(sentences)
    percentage, percent_text = match_percentages(name, sentences)
    specialties = term_to_wiki_data[alias_list[0]][0]
    table['Name'].append(name)
    # The count of total aggregated sentences.
    table['Count'].append(count)
    # The wiki-determined medical specialties associated with the concept.
    table['Specialties'].append(specialties)
    # The computed representative percentage.
    table['Percentage'].append(percentage)
    table['Percentage Text'].append(percent_text)
    # The aggregated aliases of the medical concept
    table['Aliases'].append(alias_list)
    
df_medical = pd.DataFrame(table)
df_medical.head(3)

We now have a table of aggregated medical terms, along with their measured percentages. This table will serve as the basis for our medical knowledge base. We’ll proceed to develop that medical knowledge base, by differentiating between symptoms and comorbidities.

# 6. Building a Knowledge Base of Medical Terms

We’ll start by probing our table in more detail. First, we’ll sort the table by mention-count and also by measured percentage. Afterwards, we’ll output the top 10 sorted medical terms.


In [None]:
df_medical.sort_values(['Count', 'Specialties'], ascending=False, inplace=True)
df_medical.head(10)

Our top 10 terms fall into several broad categories. Some terms, like _Fever_ and _Cough_, are symptoms of Covid-19. Other terms, _Diabetes_ and _Hypertension_ are its comorbidities. Can we separate the comorbidities from the symptoms? We can try! Let’s turn our attention to extracting the symptoms.

### 6.1 Uncovering the Symptoms of Covid-19

Given a line of text, how do extract all medical symptoms mentioned in that text? Well, we can just ask for the symptoms! After-all, we still have our transformer-powered `answer_question` function at our disposal. All we need to is input `’What are the symptoms?’` into `answer_question` along with the text. Afterwards, we’ll obtain our result.

In [None]:
text = "Fever ( 89.8% ) and Cough ( 67.2% ) were common clinical symptoms, while diabetes and hypertension were a common comorbidity." 
# Please note that question-answering better if we strip out the parantheses fom the text.
answer_question('What were the symptoms?', remove_parans(text))

With this in mind, let’s define an `is_symptom` function. The function will take as input some medical term, such as _Fever_, _Cough_, or _Diabetes_. Afterwards, the function will iterate over all sentences that mention that term. For each sentence, the function will ask for symptoms. It will then check whether the term is found among the symptoms. If at-least 10% of the sentences reference the term amongst the symptoms, then `is_symptom` will return `True`.

In [None]:
# Constructs a symptom regular expression. Symptoms are sometimes referenced as "manifestations."
symptom_synonyms = ['symptom', 'manifestation']
# A word-boundary is not added to the right-side of the regex, to allow for plurality.
symptom_regex = compile_pattern_list([r'\b' + s for s in symptom_synonyms],
                                     escape=False, add_bounderies=False,)

def is_symptom(term, min_fraction=0.1): 
    """Returns True a medical term is referred to as a symptom among its mentions.
    
    Parameters
    ----------
    term: str
        A medical term within our `df_medical` DataFrame.
    min_fraction (optional): float
        The minimum fraction of `term` sentences that must refer to the term 
        as symptom for the function to return True. Preset to 10%.
    """
    aliases = df_medical.loc[df_medical.Name == term].Aliases.values[0]
    # Compiles a regular expression containing all the aliases of `term`.
    alias_regex = aliases_to_regex(aliases)
    count = 0
    # The minimum number of symptom matches required to return True.
    min_count = int(min_fraction  * len(aggregated_sentences[term]))
    for sentence in aggregated_sentences[term]:
        # We ask for the symptoms of the sentence.
        answer = ask_for_symptoms(sentence)
        if answer and alias_regex.search(answer):
            # The term is referred to as a symptom in the sentence.
            count += 1
            if count >= min_count:
                return True
            
    return False        
    

def ask_for_symptoms(text):
    """Using question-answering to ask for the symptoms in the text"""
    text = remove_parans(text)
    for question in [f'What are the {s}s?' for s in symptom_synonyms]:
        # We ask for symptoms, as well as clincial manifestations.
        answer = answer_question(question, text)

        if not answer:
            continue
        
        # In order to up precision, we check for the mention of symptoms 
        # in the answer, or right before the answer.
        if re.search(r'\b(symptoms|manifestations)\s+(were|are)$',
                     text.split(answer)[0].strip()):
            return f"symptoms are {answer}"
        
        if symptom_regex.search(answer):
            return answer
    
    return None
                
def aliases_to_regex(aliases):
    # Transforms the alieases into reguler expersions. The longer aliases 
    # are matched first.
    aliases = sorted(aliases, key=len, reverse=True)
    return compile_pattern_list(aliases)

for term in ['Fever', 'Cough', 'Diabetes', 'Hypertension']:
    if is_symptom(term):
        print(f"{term} is a symptom.")
    else:
        print(f"{term} is not a symptom.")

We’ll now proceed to isolate all symptoms in a separate `df_symptoms` DataFrame. All other medical terms will be added to `df_not_symptoms`.


In [None]:
are_symptoms = np.array([is_symptom(name) for name in df_medical.Name.values])
df_symptom = df_medical[are_symptoms]
df_not_symptom = df_medical[~are_symptoms]
print(f"{df_symptom.shape[0]} of our medical terms are symptoms.")
print(f"{df_not_symptom.shape[0]} of our medical terms are not symptoms.")

20 of our medical terms are symptoms. Let’s output these symptoms for review.

In [None]:
df_symptom

Our symptoms all make sense. Now let’s sample from our `df_not_symptom` DataFrame.

In [None]:
df_not_symptom.head(10)

Some of our printed terms are comorbidities (_Diabetes_, _Hypertension_). Other terms like _Pneuomonia_ represent medical consequences of COV-19 onset.  We’ll now turn our efforts to extracting out the comorbidities.


## 6.2 Uncovering Covid-19 Comorbidities

Uncovering comorbidities is trickier than uncovering the symptoms. The language around comorbidities tends to be more varied. Hence, it’s harder to extract these comorbidities out without some supervision. We’ll need to manually sample from `df_not_symptom` in order to extract our initial comorbidity data. Purely random sampling will not be productive. We need to sample in an intelligent manner, to ensure that the most common comorbidities appear in our sampled results. One way to sample intelligently is to leverage our existing _Specialty_ column. Medical specialties associated with common comorbidities are expected to occur more frequently. Hence, sampling from the top-most frequent specialties should yield some insightful results. Therefore, let’s print out the most frequently-mentioned specialties. We’ll only print those specialties that are mentioned more than twice.

In [None]:
from collections import Counter
# We count specilties by the number of terms that fall within each specialty. Each
# unique term is counted only once. Alternatively, we could weighing each specialty
# by the total number of term mentions.
specialties = Counter(df_not_symptom.Specialties.values)
print(f"We have {len(specialties)} medical specialties in total.")
print("The top-ranking specialties are:\n")
for specialty, count in specialties.most_common():
    if count <= 2:
        break
        
    print(f"{specialty}:  {count}")

We’ve outputted seven top-ranking specialties. Let’s now examine these specialties one-by-one. For each specialty, we will gather up all the associated terms. Afterwards, we will print out representative sentences for each term. Within each sentence, we’ll highlight the term in red, for easier readability. We’ll also highlight each term’s measured percentage in blue. Furthermore, we’ll include the term’s DataFrame index. This will allow us to isolate those indices thR correspond to comorbidities.

In order to highlight medical term matches, we’ll convert all our outputs to HTML. Subsequently, we’ll render that HTML. For this purpose, we’ll define a `display_specialty` function. Given an input specialty, the function  will display a rendered summary, in the manner described above.


In [None]:
from IPython.core.display import display, HTML

def display_specialty(specialty):
    # Visualizes key information about a specialty using HTML.
    html = specialty_to_html(specialty)
    display(HTML(html))
    
def specialty_to_html(specialty):
    # Extracts key information about a specialty, highlighting features using HTML.
    df = df_not_symptom[df_not_symptom.Specialties.isin([specialty])]
    #html = f'<h3>{specialty}</h3></br>'
    html = ''
    for index, row in df.iterrows():
        # We iterate of all non-symptometic medical terms within the specialty.
        tup = row[['Name', 'Count', 'Aliases', 'Percentage', 'Percentage Text']]
        name, count, aliases, percentage, text = tup
        # For each medical term, we output its index as well as count.
        html += f'<h4>{index} {name.upper()}: Count {count}</h4>' 
        
        if text:
            # A representative percentage is associated with the medical term.
            # We'll highlight that percentage within the representative text.
            percentage = str(int(percentage)) if int(percentage) == percentage else str(percentage)
            regex = re.compile(r'\b%s\b' % percentage)
            # The percentage is boldened and colored blue.
            text = add_markup(regex, text, multi_matches=False,
                              color='blue', bold=True)     
        else:
            # If no percentage is found, then we choose a random sentence that's
            # associated with the medical term.
            text = aggregated_sentences[name][0]
        
        regex = aliases_to_regex(aliases)
        # We color the medical term red within the text, for a better display.
        text = add_markup(regex, text, color='red')
        html += text + '</br></br>'
    
    return f'<html>{html}</html>'
    
        
def add_markup(regex, text, multi_matches=True, **kwargs):
    """ Adds markup to all matches of a regular expression within the text.
    
    Parameters
    ----------
    regex: re.Pattern
        A compiled regular expression that we match against the text.
    text: str
        Our inputted text string.
    multi_matches (optional): Bool
        If True, then multiple regex matches will be marked up within the
        text
        
    Returns
    -------
    marked_text: str
        Text with HTML markup added to all regex matches.
    """
    offset = 0
    for match in regex.finditer(text):
        old_length = len(text)
        span = (match.start() + offset, match.end() + offset)
        text = _add_span_markup(span, text, **kwargs)
        # Offset tracks length-shift in the text due to markup addition.
        offset += len(text) - old_length
        if not multi_matches:
            break
        
    return text
    
def _add_span_markup(span, text, color='black', bold=False):
    """Adds markup across a single span of text. Colors and optionally
    bolds that span.
    
    Parameters
    ----------
    span: (int, int)
        The start and end span of text that we'll markup.
    text: str
        The complete text
    color (optional): str
        The color to assign the marked-up span.
    bold (optional): bool
        If True, than boldens the marked-up span.
        
    Returns
    -------
    marked_text: str
        Text with HTML markup added across the specified span.
    """
    start, end = span
    snippet = text[start: end]
    html_snippet = f'<font color="{color}">{snippet}</font>'
    if bold:
        html_snippet = f'<b>{html_snippet}</b>'
    
    return text[:start] + html_snippet + text[end:]


We are now ready to investigate the specialties. Let’s start by viewing the infectious diseases.

In [None]:
display_specialty('Infectious disease')

The infectious disease output does not appear that useful. No comorbidities are present. Most of the output refers to other infectious epidemics that are analogous to SARS-CoV-2. Perhaps we’ll have more luck with Pulmonology, which is next on our specialty list.

In [None]:
display_specialty('Pulmonology')

In this output we see several comorbidities including COPD (Index 22) and Asthma (Index 9).  Let’s store these comorbidity indices in a set for later use. Also, in our output, we observe several symptoms that belong in our `df_symptom` DataFrame. These symptoms include Sputum (Index 96 ), Wheeze (Index 104) and Crackles (Index 33) are actually Covid-19 symptoms. Let’s save the symptom indices in a separate set. Later, we will transfer these indices to `df_symptom` for maximum recall.

In [None]:
comorbidity_indices = {22, 9}
symptom_indices = {96, 104, 33}

Next, we'll examine Hematology.

In [None]:
display_specialty('Hematology')

No comorbidities have been observed. Let's move on to Opthalmology.

In [None]:
display_specialty('Ophthalmology')

Conjunctivitis or pink-eye (29) is a known ocular symptom of Covid-19. So is Dry Eye Syndrome (42). We’ll add these to our `symptom_indices` set. Meanwhile, diabetic retinopathy (37) is a comorbidity. We'll add it to `comorbidity_indices`.

In [None]:
comorbidity_indices.add(37)
symptom_indices.update([52, 42])

Now, we'll take a look at Cardiology

In [None]:
display_specialty('Cardiology')

Hypertension (56) and Cardiovascular Disease (15) are comorbidities.

In [None]:
comorbidity_indices.update([56, 15])

Let's move on to Endocrinology.

In [None]:
display_specialty('Endocrinology')

Diabetes (36), Obesity (80) and Hypothyroidism (60) are comorbidities.

In [None]:
comorbidity_indices.update([36, 80, 60])

Finally, we'll examine Nephrology.

In [None]:
display_specialty('Nephrology')

Chronic Kidney Disease (21) is a comorbidity.

In [None]:
comorbidity_indices.add(21)

We’ve examined the top 7 of our 47 specialties by hand. Do we really need to examine the remaining 40 specialties? Not necessarily. All the remaining specialties contain no more than 2 medical terms. There’s a reasonable likelihood that these terms aren’t actual comorbidities. However, some terms are comorbid. We thus can heuristically assume that these terms are mentioned with some other comorbidity. Furthermore, many top comorbidities have already been recorded in our `comorbidity_indices` Hence, let’s execute the following heuristic. We’ll output only those remaining specialties that contain a term which is co-mentioned with a recorded comorbidity. Afterwards, we’ll examine these specialties by hand, and add additional comorbidities to our set.

In [None]:
from itertools import chain
comorbid_names = df_not_symptom[df_not_symptom.index.isin(comorbidity_indices)].Name.values
# A set of sentences that mention a comorbidity.
comorbid_sentences = set((chain.from_iterable([aggregated_sentences[name] 
                                               for name in comorbid_names])))
for specialty, count in specialties.most_common()[7:]:
    # We iterate over the remaining 40 specialtities.
    df_tmp = df_not_symptom[df_not_symptom.Specialties.isin([specialty])]
    for name in df_tmp.Name.values:
        if comorbid_sentences & set(aggregated_sentences[name]):
            # The term in the specialty is mentioned in a sentence that also
            # mentions a comorbidity.
            display_specialty(specialty)
            break


Cancer (14), Lung Cancer (72), Cerebrovascular disease (18),  Cerebral infarction (17), Cirrhosis (23), Acute Pancreatitis (2), Coronary Artery Disease (30), Tuberculosis (99), Immunodeficiency (61), and Kidney Disease (65) are all additional comorbidities that we have missed. Let’s add them to `comorbidity_indices`.

In [None]:
comorbidity_indices.update([14, 72, 18, 17, 23, 2, 30, 99, 61, 65])
print(f"We uncovered {len(comorbidity_indices)} total comorbidities")

We’ve uncovered 19 total comorbidities. Let’s transfer these comorbidities to a separate `df_comorbidities` DataFrame. We'll then print and review the comorbidities.

In [None]:
is_comorbid = df_not_symptom.index.isin(comorbidity_indices)
df_comorbidity = df_not_symptom[is_comorbid]
df_comorbidity.sort_values(['Count', 'Specialties'], ascending=False)
df_comorbidity

Our commorbities cover diabetes, heart issues, lung issues, kidney / liver issues, They also include cancer, obesity and immunodeficiencies. Co-infection by tuberculosis also makes our list. With comorbidities in-place, our knowledge is nearly complete. We just need to update our `df_symptom` DataFrame with the newly-discovered symptoms in `symptom_indices`.  

In [None]:
df_symptom = df_symptom.append(df_not_symptom.loc[symptom_indices])
df_not_symtpom = df_not_symptom.drop(index=symptom_indices)
print(f"We uncovered {df_symptom.shape[0]} total symptoms")

We’ve uncovered 25 total symptoms, as well as 19 total comorbidities. We will assume that all remaining medical terms are neither symptoms nor comorbidities. Let’s shift these terms to a `df_other` DataFrame.

In [None]:

not_comorbid = ~df_not_symptom.index.isin(comorbidity_indices)
df_other = df_not_symptom[not_comorbid]

total = df_medical.shape[0]
percent_other = int(100 * df_other.shape[0] / total)
print(f"{100 - percent_other}% of our medical terms are symptoms or comorbidities.")
print(f"The remaining {percent_other}% of terms fall into the 'Other' category.")

Nearly 40% of medical terms have been effectively categorized as either symptoms or comorbidities. Our knowledge base is ready! Now, we’ll leverage that knowledge base to build an intelligent search tool. The search will allow us to probe for critical risk factors within our data. It will yield insights that will allow us to construct a risk prediction model.

# 7. Building a KB-Powered Search Tool to Probe for Risks


We want to construct a search tool that will rank matched texts by matches to comorbidities, symptoms, and other medical terms. Furthermore, we want the tool to color all such matches within the text by category type. With this in mind, we’ll construct three large-scale regular expressions to match against our three DataFrames, `df_comorbidity`, `df_symptom`, and `df_other`. Additionally, we’ll include a fourth regular expression that detects valuable percentage matches. The regular expressions will be leveraged by a `mark_matches` function, which will color-code all matched terms within a text. The function will return a marked HTML string. It will also return a count of matches carried out by each of our four regular expressions. Later, we will use these counts as our ranking criteria within the search tool


In [None]:
def aliases_to_regex(df): 
    # Converts all aliases with a medical DataFrame into a regular expression
    patterns = set(np.hstack(df.Aliases.values))
    patterns.update([name.lower() for name in df.Name.values])
    patterns = sorted(patterns, key=len, reverse=True)
    return compile_pattern_list(patterns)

# Our regular expressions match comorbidities, symptoms, other medical terms, and also percentages.
regex_list = [aliases_to_regex(df)
              for df in [df_comorbidity, df_symptom, df_other]] + [regex_percent]

# Markup colors are assigned to each match type.
colors = ['blue', 'green', 'brown', 'black']
format_kwargs = [{'color': c} for c in colors]
# Percentages will be boldened in the marked up text.
format_kwargs[-1]['bold'] = True

def mark_matches(text):
    """A input text is converted into a marked up HTML string, based on 
    terminology matches. All match counts are also returned.
    
    Parameters
    ----------
    text: str
        The text we match against.
        
    Returns
    -------
    marked_text: str
        A marked-up version of the text based on regex matches.
        
    match_counts: [int, int, int, int]
        A list of four match counts, one for each regex in `regex_list`.
    """
    match_counts = []
    for regex, kwargs in zip(regex_list, format_kwargs):
       
        match_counts.append(len(regex.findall(text)))
        text = add_markup(regex, text, **kwargs)
    
    return text, match_counts


Let’s test-out `mark_matches` by applying it to random _Diabetes_ sentence. We’ll expect that function to match at-least one comorbidity (Diabetes) and also to match at-least one percentage.

In [None]:
text = aggregated_sentences['Diabetes'][0]
marked_text, match_counts = mark_matches(text)
for count, name in zip(match_counts, ['comorbidities', 'symptoms', 'other terms', 'percentages']):
    print(f"The text matches {count} {name}.")

print('\nDisplaying the marked-up text:')
display(HTML(mark_matches(text)[0]))

Our function works! We’ll now utilize it to create a ranked-search tool. Basically, we’ll define a `ranked_search` function. The function will take as input a regular expression that will be matched against all 97.6K clinically relevant sentences. All matching results will subsequently be ranked in following manner:

1. Abstract sentences will take precedence over non-abstract sentences, since clinically critical sentences are more likely to wind up in the abstract.
2. Sentences with a diversity of matched categories (symptoms, comorbidities, percentages, etc) will take precedence over sentences with just a single type of matched category (symptoms only, etc). Knowledge diversity is more interesting, since it allows to examine the interplay between comorbidities, measurements, etc.
3. Sentences with a higher total match count will take precedence over sentences with a lower total match-count. 

Our top-ranking sentences will be displayed using HTML. All medical-term matches will be marked appropriately, by color category. Additionally, our query term will be marked in red within the output.


In [None]:
def ranked_search(query_string, results_per_page=7, page=1):
    """A ranked search tool for extracting relevant sentences.
    The top-ranked matches are displayed as HTML.
    
    Parameters
    ----------
    query_string: str:
        Our query string that's used to match the sentences. This string
        can be a regular expression.
    results_per_page (optional): int
        The number of results to dispaly within a page of output.
    page (optional): int
        The page-number. It allows us to flip through multiple 
        pages of results.
    """
    regex = compile_(query_string)
    matches = [s for s in all_sentences if regex.search(s)]
    # This dictionary maps matched sentences to a tuple that is used for ranking purpuses.
    matches_to_ranking = {}
    for match in matches:
        # All matches to the query are marked in red.
        marked_match = add_markup(regex, match, color='red', bold=True)
        marked_match, match_counts = mark_matches(marked_match)
        # The number of diverse categorical matches.
        num_matches = len([count for count in match_counts if count])
        # The total number of matches.
        num_total_matches = sum(match_counts)
        # Assigns each match a tuple, for ranking purposes.
        matches_to_ranking[marked_match] = (match in abstract_sentences, num_matches,
                                            num_total_matches)
        
    # Ranks matches by importance
    sorted_matches = [m[0] for m in sorted(matches_to_ranking.items(),
                                           key=lambda x: x[1], reverse=True)]
    start = (page - 1) * results_per_page
    end = start + results_per_page
    # Displays the top results using HTML.
    html = '<br>'.join(sorted_matches[start: end])
    display(HTML(html))

Let’s take our `ranked_search` function for a spin! We’ll start by searching for _risk factors_.

In [None]:
ranked_search('risk factors')

The query yields some valuable insights. Symptoms such as _dyspnea_, _chest pain_, _cough_ are increased risk factors for COVID-10 pneumonia. So is age, as well as underlying comorbidities (such _hypertension_ and _diabetes_). Additionally, we’ve learned that _asthma_ and _COPD_ and not risk factors for SARS-CoV-2 infection. That’s good to know!

Next, we’ll search for pregnancy and prenatal-related sentences.

In [None]:
ranked_search(r'(pregnan(t|cy)|pre-?natal)')

Well we’ve gained some insight on how pregnant women react to other types of respiratory diseases. For instance, in 2009, pregnant women accounted for 5% of all H1N1 related deaths. Also, we learned that previous Coronavirus outbreaks (SARS-CoV / MERS) were known to be responsible for severe complications during pregnancy. However, we learned little about how pregnant women respond to SARS-CoV-2. Perhaps the next page of results will yield greater insights.

In [None]:
ranked_search(r'(pregnan(t|cy)|pre-?natal)', page=2)

It seems that pregnancy complications (including fetal distress) have been observed in multiple pregnant women after the onset of COVID-10. Fetal destral and premature membrane ruptures  can happen if the infection occurs during the third trimester of pregnancy. However, on a more positive note, there is no evidence of intrauterine infection in pregnant women. Thus, it is unlikely that pregnant women can pass-on the virus to the fetus.

Now, let’s search for _coinfections_.

In [None]:
ranked_search(r'co-?infection')

We learned that MTB (Mycobacterium tuberculosis) co-infection is linked with disease severity. Also, coinfection is common in pediatric patients. Thus, coronavirus-infected children are potentially at risk for other infections (even though generally, the effects of the virus are less severe in children).

We’ll proceed to search for comorbidities.

In [None]:
ranked_search(r'co-?morbidities')

72.2% of patients treated in the ICU had underlying comorbidities. This is relative to the 37.3% of comorbid patients who did not require intensive care.. Such recorded percentages are quite valuable. Essentially, they serve as conditional probabilities of disease-features given latent outcomes. The probability of a comorbidity given a severe (ICU-worthy) outcome is 0.722. Meanwhile, the probability of a comorbidity given a milder outcome if only 0.373. These conditional values can allow us to build a Bayesian model. Hence, the values are worth recording. Also, there are other conditional percentages within our output. Patients with severe progression are more likely to have  dyspnea (63.9% vs 19.6%) and anorexia (66.7% vs 30.4%). Let’s store all these percentages within a `conditionals` dictionary. Later, we will utilize that dictionary for modeling purposes.


In [None]:
conditionals = {'comorbidities': (72.2, 37.3), 'dyspnea': (63.9, 19.6), 'anorexia': (66.7, 30.4)}

Let’s extract more progression stats. We’ll now search for _progression group_.

In [None]:
ranked_search(r'progression group')

27.3% of patients with patients with progression have a history of smoking. This is relative to the 3% of the patients in the stable group that have a history of smoking. We’ve just obtained another conditional percentage: **Smoking: 27.3% vs 3%**. Let’s add it to our conditional dictionary.

_NOTE: Not all our conditional percentages are equal. Different measurements correspond to different levels of disease progression / severity. Thus, any model built around these values serves as a baseline-effort, not as a rigorous statistical analsyis_.


In [None]:
conditionals['smoking'] = (27.3, 3.0)

Now, let’s examine patient mortality stats. We’ll search for _non_survivor_.

In [None]:
ranked_search(r'non-?survivor')

We’ve obtained two additional conditional variables relative to non-survivors / survivors. These variables include _disseminated intravascular coagulation_ (71.4% vs 0.6%) and _BMI > 25 kg / m2_ (88.24% vs 18.95%). Should we add these conditionals to our dictionary? Well, it depends. Our previous conditionals all reflect disease progression. These conditionals reflect death; the ultimate end-point of disease progression. Still, it’s difficult for us to compare death and disease progression directly. Nonetheless, if we treat these numbers as estimates, we can cautiously construct a baseline model. The model won’t be based on rigorous statistics. However, it will provide some baseline predictive power that otherwise would not be available. Hence, we’ll actually add these two conditionals to our dictionary.

In [None]:
conditionals.update({'disseminated intravascular coagulation': (71.4, 0.6),
                     'BMI>25': (88.24, 18.95)})

Now, we’ll run a final search to get some stats on patients who have stabilized.

In [None]:
ranked_search(r'\bstabilized\b')

85.9% of observed have improved/stabilized, according to one article. The remaining 14.1% percent of patients have deteriorated. We’ll treat these two percentages as our prior probabilities of deterioration / stabilization.

In [None]:
prior_probs = [0.141, 0.859]

We now have prior probabilities, in addition to conditional probabilities of observed variables (given deterioration / stabilization). Let’s quickly output these conditional probabilities for review.

In [None]:
for condition, tup in conditionals.items():
    prob_a, prob_b = np.array(tup) / 100
    print(f'P({condition} | Deterioration) = {prob_a:.3f}')
    print(f'P({condition} | Stabilization) = {prob_b:.3f}')
    print()

Given these probabilities, we can now construct a simple Bayes classifier for prediction deterioration / stabilization.

# 8. Building a Classifier to Predict Covid-19 Progression

Our classifier will be a simple Naive Bayes model. The model will assume that all input features are independent of each other. Clearly, that is not the case. For instance, Anorexia and BMI are two features that are closely related. Nonetheless, it has been shown the Naive Bayes can work [surprisingly well](http://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf), even when some features are interdependent.  Hence, despite our vast oversimplifications, the model should provide some useful baseline insights. 

We’ll now define a `naive_bayes` function in a few lines of code, following the standard [Naive Bayes math](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Constructing_a_classifier_from_the_probability_model). Our function will take a vector of 6 binary features, corresponding to the keys within `conditionals`. It will then return a binary value demarcating whether a patient is at risk for COVID-19 progression.

In [None]:
# A matrix of conditional probabilities. Each column corresponds to a feature in `conditionals`.
# The first row corresponds to the prob(feature | deterioration). The second row corresponds
# to prob(feature | stabilization)
conditional_matrix = np.vstack([np.array(tup) / 100 
                                for tup in conditionals.values()]).T
# We'll use Maximum a posteriori estimation, so taking the Log of the
# conditional probabilities will be sufficient.
log_conditional_matrix = np.log(conditional_matrix)

def naive_bayes(feature_vector):
    results = log_conditional_matrix @ feature_vector + np.log(prior_probs)
    # Returns 1 if the conditional probability of deterioration (index 0) is maximized, and 0 otherwise.
    return 1 - results.argmax()

assert naive_bayes([0] * 6) == 0
assert naive_bayes([1] * 6) == 1

Suppose a coronavirus patient has diabetes, which is a comorbidity. Is that patient more likely to progress or to stabilize? Let’s find out!

In [None]:
features = np.array([1] + 5 * [0])
phrase = 'deteriorate' if naive_bayes(features) else 'stabilize'
print(f"The patient is more likely to {phrase}.")

The patient is more likely to stabilize. However, now suppose the patient begins to show dyspnea (shortness-of-breath). How will that affect the patient’s prognosis? We’ll check below.

In [None]:
features[1] = 1
phrase = 'deteriorate' if naive_bayes(features) else 'stabilize'
print(f"The patient is more likely to {phrase}.")

The patient is now more likely to deteriorate. Our model has some rudimentary predictive power. However, extracting binary features for the model is a bit inconvenient. We have to construct the binary features by hand, while remembering each feature’s vector index. It would be more convenient if we could extract the features automatically, from a textual description of the patient. For instance, suppose we encounter the following patient description in our records:

_The patient suffers from diabetes and is a chronic smoker. He has a BMI of 28_.

How do we extract features from that text? Well, extracting the comorbidity feature is easy! We simply need to match  our comorbidity regular expression to the text!

In [None]:
text ="The patient suffers from diabetes and is a chronic smoker. He has a BMI of 28."
comorbidity_regex = aliases_to_regex(df_comorbidity)
assert comorbidity_regex.search(text) is not None

In that same manner, we can construct text-matching regular expressions for the subsequent four features in our model. However, as caveat, we should note that these regexes will not cover negation (ex: “no history of smoking”). Still, as a baseline, this solution ought to be sufficient.

In [None]:
smoking_regex = compile_(r'\bsmok(er|es|ed|ing)\b')
# We include all aliases of the various terms, such as "Shortness of Breath", an alias of dyspnea.
tup = [aliases_to_regex(df_medical[[term in aliases 
                                    for aliases in df_medical.Aliases.values]])
       for term in ['dyspnea', 'anorexia', 'disseminated intravascular coagulation']]

dyspnea_regex, anorexia_regex, coagulation_regex = tup
feature_regex_list = [comorbidity_regex, smoking_regex, dyspnea_regex, anorexia_regex, coagulation_regex]
def features_from_regex(text):
    return [int(regex.search(text) is not None) for regex in feature_regex_list]

print(features_from_regex(text))

Obtaining the final feature is more tricky. We need a way to extract the BMI. One solution is to just ask for it! Inputting “What is the patient’s BMI?” into `answer_question` will yield the BMI information.

In [None]:
answer_question('What is the patient\'s BMI?', text)

Hence, we can define an `is_large_bmi` function, which extracts the final binary feature from the text.

In [None]:
def is_large_bmi(text):
    # Returns 1 if a BMI of greater > 25 is mentioned in the text, and zero otherwise.
    answer = answer_question('What is the patient\'s BMI?', text)
    match = re.search(r'[0-9]+', text)
    if match:
        bmi_value = int(text[match.start(): match.end()])
        return int(bmi_value > 25)
    
    return 0

assert is_large_bmi(text) == 1

Now, we have all pieces required to define a `predict_risk` function. Given a textual description of a patient, the function will extract all features from the text. Subsequently, it will process the feature vector with our predefined `naive_bayes` classifier.

In [None]:
def predict_risk(text):
    # Predicts a patient's risk of deterioration from a textual description.
    features = features_from_regex(text) 
    # Please note, we are assuming that a BMI description is always included in the text.
    features.append(is_large_bmi(text))
    return(naive_bayes(features))

phrase = 'deteriorate' if predict_risk(text) else 'stabilize'
print(f"The patient is more likely to {phrase}.")

Based on the textual description of the patient, we know that they’re at risk of deterioration. Thus, our baseline model allows us to prioritize high risk patients, using just recorded clinical notes.