# COVID-19 Risk Factors: Diseases, viruses and bacteria mentioned in relevant literature

![osthus.png](attachment:osthus.png)

## OSTHUS Team Members


|Name|Email|Kaggle User ID| Kaggle User Name|
|---|---|---|---|
|Arne Balzer| arne.balzer(a)osthus.com| 4786361|arnebalzer|
|Nikhil Damle |nikhil.damle(a)osthus.com | 4786390|nikhildamle|
|Jing Guo |jing.guo(a)osthus.com| 4896719 | osthusjingguo|
|Ning Meng | ning.meng(a)osthus.com| 4893125| ningmengosthus|
|Sujit Kumar  |sujit.kumar(a)osthus.com| 4783844|de00215|
|Karen Schomburg| karen.schomburg(a)osthus.com| 4786990|drkarenschomburg|
|Chuan-Lu Yu| chuan-lu.yu(a)osthus.com| 4787352|alexchuanluyu| 

## Problem Statement: What do we know about COVID-19 risk factors?

We undertook the task of investigating what the literature reports about the potential risks factors. Specifically, we asked whether the literature informs us if co-infections by other pathogens and co-existing health conditions pose any risk of corona-infection. We have also developed statistical approaches to assess the extent of the risk.
    

## Abstract

COVID-19 is one of the deadliest pathogens of recent times with ~2 million positive cases and >100,000 reported deaths worldwide. In absence of a vaccine, social distancing is the only way in which its spread can be contained. Vaccine development therefore is the need of the hour and understanding the risk factors leading to the viral infection is a key step in that direction. We focused on two potential risk factors - co-infection by other viruses and bacteria and co-existing health conditions or diseases. Using the ontologies corresponding to viruses, bacteria and diseases, we mined the abstracts of the literature records for simultaneous mentions of the ontology terms and COVID-19, and assessed their risk potential using two statistical metrics. In the co-occurrence analysis, HIV and Escherichia Coli were ranked as the top co-infecting viral and bacterial pathogens respectively whereas the disease "Feline infectious peritonitis and pleuritis" was ranked as the co-existing health condition that poses the highest risk of corona-infection. An assessment of statistical probability, reveals that among others, the viruses “mastadenovirus”and “cardiovirus”, the bacteria “bordetella” and “gemella” and the diseases “pancreatitis” and “myocarditis” are enriched within the COVID-19 related literature.
These analyses may be extended/improved by including the semantic context and/or by adapting more sophisticated machine-learning and statistical approaches to prioritize these risk factors and also to integrate them in a weighted manner with other potential risk factors.

## Introduction


### Utilizing semantic links for risk factor identification

Relatively early after the start of the COVID-19 epidemic it was observed that patients show drastically different disease outcomes. One of the first identified risk factors was the age of the patients, but the specific risk factors remain an active field of study. In our submission, we investigate the impact of co-infections with other viruses and bacteria and also the risk of other diseases. For the analysis we use a semantic approach and utilize the rich power of ontologies: For identification of viruses and bacteria, we use the SNMI ontology (Cote, Roger A., editor. Systematized Nomenclature of Human and Veterinary Medicine: SNOMED International. Northfield (IL): College of American Pathologists; Schaumburg (IL): American Veterinary Medical Association, Version 3.5, 1998.), available here (https://www.kaggle.com/arnebalzer/snmi-disease-vocabularyjson under UMLS License). By utilizing the ontology we can not only look for specific terms but also for the already listed synonyms. Also, the ontology provides the hierarchy which allows us the clustering, e.g. to group all respiratory malfunctions in the class “disease of respiratory system”. Furthermore, we can in our analysis go beyond counting the occurrence of terms in the literature records: The ontology contains links between viruses, bacteria and diseases, thus if we as an example find the term  “adenovirus (CUI:C000148)”, the vocabulary links us to the diseases “adenoviral meningitis” (CUI: C0276160), "adenoviral myocarditis"(CUI:C0276163), "adenoviral respiratory disease" (CUI:C0276150)….and others. These semantic links can be exploited for analysis beyond pure text parsing.

### Our Approach

We took these steps (for details look into the code sections)
*	Filtering of all literature records for occurrence of Coronavirus and its synonyms (from here on called “corona-subset”)
*	Counting viruses, bacteria and disease terms with their synonyms in the literature records abstracts
*	Assessing the statistic relevance of the occurrences of the terms by comparing the counts in the “corona-subset” with the counts of the complete literature and assessing the probability of the occurrence with a Bayesian analysis. 
* Our result shows possible risk factors regarding co-infections with other viruses and bacteria as well as diseases. 

### Results



#### Ontologies of viruses, bacteria and diseases reveal co-occurrence with coronavirus
* The results of the analysis of occurrences of the virus/bacteria/disease in the corona-subset of the literature can be found in detail in section "Word counting". 
* The biorxiv_medrxiv and noncomm_use_subset datasets are used for analysis (referred as all-data) out of the complete literature provided.
* While the virus vocabulary contains 1379 terms with synonyms, only 112 are found in the corona-subset of the literature. The word cloud shows the found terms (main label, not synonyms) with their size indicating the occurrence count. The bar chart shows the 20 terms found in most literature records (number of papers that mention the term or one of its synonyms is plotted on the x-axis). In this list, HIV is the most prominent, indicating that a HIV co-infection might be a risk factor. Also other highly mentioned viruses like Ebola or Zika are interesting results.  


![corona-viruses.png](https://i.imgur.com/eKNMf6U.png)



![bar-chart-viruses.png](https://i.imgur.com/Nx31PnQ.png)


* The bacteria vocabulary contains 4068 terms of which 142 are found in the corona-subset of the literature. The word cloud shows the found terms (main label, not synonyms) with their size indicating the occurrence count. The bar chart shows the 20 terms found in most literature records (number of papers that mention the term or one of its synonyms is plotted on the x-axis). Among the top ones are E.coli, mycoplasma and salmonella, indicating that these could poste a risk factor. 



![corona-bacteria.png](https://i.imgur.com/AE50cEC.png)

![bar-chart-bacteria.png](https://i.imgur.com/d5zffTy.png)


* The disease vocabulary contains 13310 terms with synonyms of which 209 are found in the corona-subset of the literature. The word cloud shows the found terms (main label, not synonyms) with their size indicating the occurrence count. The bar chart shows the 20 terms found in most literature records (number of papers that mention the term or one of its synonyms is plotted on the x-axis). It lists well known symptoms of COVID-19 like pneumonia but also interestingly others like hepatitis which might indicate a risk-factor. 

![](https://i.imgur.com/eKNMf6U.png)

![](https://i.imgur.com/tdoJvhU.png)












#### Statistical analyses identify potential risk factors over-represented in the context of COVID-19 
    
We computed for each vocabulary term a ratio of probability with which it is mentioned in literature records that also mention COVID-19 (the “corona-subset” described above) to the probability with which it is mentioned in all literature records irrespective of whether they mention COVID-19 or not (expected probability). We call this ratio the “enrichment factor” or the EF. EF > 1 would mean that the term is mentioned in COVID-19 related literature more often than expected (over-represented). EF is 0 for all the terms that are not mentioned even once in COVID-19 related literature (under-represented). Of the total 112 virus terms, 142 bacterial terms and 209 disease terms that are found in COVID-19 related literature,multiple entries had EF>1 respectively. Below is shown a bar-chart showing the top 50 terms from all vocabularies combined together arranged in descending order of the corresponding EFs. Thus its presents an overview of the risk factors most likely associated to COVID-19.   

![barplot](https://i.imgur.com/EM4tM2l.png)



#### Bayes' theorem estimates the risk potential

For each vocabulary term, we also computed the likelihood of COVID-19 mentions given that the literature record already mentions the term. As expected, the term giving the highest posterior probability of mentioning COVID-19, also had the highest enrichment factor (EF). These posterior probabilities follow Gaussian distribution and may be used to extract the most relevant risk factors as those with posterior probabilities close to 1. For details have a look at the table in section "Statistical Analysis".

### Discussion

Our results and analysis give first ideas and hints where important risk factors might be found. However, the approach is still only scratching the surface of the topic and researchers need to closely evaluate what it means if we find that a term is enriched in the corona-literature. We see our analysis as an entry point into the topic and in the Outlook section poste various ideas how to continue and deliver more value. 

### Outlook

Our approach is basic work that can lead to highly interesting analyses. Next steps cover analysis of the term context, utilizing the semantic linking power and extending the NLP analyses:

While counting the co-occurrence of coronavirae and other viruses, bacteria and diseases already gives input into potential risk-factors, analyzing the context of the identified terms will give even more value, e.g. finding prefixes like "pre-existing", "chronic", "vaccine". Also, we could expand our analyses of the literature records from just reading and analyzing the abstracts to the body text or specific sections of the publications like the results. Furthermore, the approach can easily be extended to analyze other risk factors like socioeconomics, health status, and others by simply feeding it with other vocabularies. Also, the semantic links of the ontologies can be further explored to allow conclusions that go even beyond the pure text analysis. More advanced NLP algorithms could be utilized to extract and annotate the context of the risk factors more.

            

# Vocabulary Word Cloud Generation

It generates word cloud for the vocabularies found in the corpus in different perspectives.

In [None]:
import spacy
import gzip, pickle
from tqdm import tqdm
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import json

### Initialize vocabularies

In [None]:
vocabularies=[
    'viruses',
    'diseases',
    'bacteria'
]

### Define Function to Read Data
Data file is saved in format of pickle and gzipped for better disk usage, pandas DataFrame will be returned


#### Pre-Built Data Frame Structure
![df-example](df-example.png)


Example to read data from pre-built data for virus vocabulary
```python
df = read_df('./vocabulary-mentions/corona_with_viruses_mentioned_and_label_only.df.pklz')
print(json.dumps(df.viruses_mentions[69], indent=2))
```

The virus mention result will be a list of matches, with id, text, and context.

* id is the unique URI from ontology
* text is the matched world
* context is a list of tokens before and after the matched text.

```json
[
  {
    "id": "http://purl.bioontology.org/ontology/SNMI/L-33500",
    "text": "Coronavirus",
    "context": [
      "East",
      "Respiratory",
      "Syndrome",
      "Coronavirus",
      "Antibodies",
      "Bactrian",
      "Hybrid"
    ]
  }
]
```

In [None]:
def read_df(file_name):
    with tqdm.wrapattr(gzip.open(file_name, 'rb'), "read", desc='read from ' +file_name) as file:
        return pickle.load(file)

### Create Id Map for All Vocabularies
Create Id Map to combine synonyms later

Vocabulary data structure is like below, label and synonyms are used for matching in the corpus.
```json
{
  "ID": "http://purl.bioontology.org/ontology/SNMI/L-30605",
  "properties": {
    "label": "Human enterovirus 72",
    "synonyms": [
      "Hepatitis A virus",
      "Infectious hepatitis virus"
    ],
    "associated": "http://purl.bioontology.org/ontology/SNMI/DE-35101",
    "parents": "http://purl.bioontology.org/ontology/SNMI/L-30600"
  }
}
```

In [None]:
def get_vocabulary_id_map(v):
    id_map={}
    with open(f'../input/snmi-disease-vocabularyjson/{v}.json') as file:
        v_json = json.load(file)
        for key, value in v_json.items():
            id_map[value['ID']]=value
    return id_map

In [None]:
all_vocabulary_id_map={}
for v in vocabularies:
    all_vocabulary_id_map[v]=get_vocabulary_id_map(v)
    print('load vocabulary for', v, len(all_vocabulary_id_map[v].keys()))


### Define Function to Display Word Cloud

In [None]:
def wordcloud(df, name, title = None):
    # Set random seed to have reproducible results
    np.random.seed(64)
    
    wc = WordCloud(
        background_color="white",
        max_words=200,
        max_font_size=40,
        scale=4,
        random_state=0
    ).generate_from_frequencies(df)

    wc.recolor(color_func=wcolors)
    
    fig = plt.figure(1, figsize=(15,15))
    plt.axis('off')

    if title:
        fig.suptitle(title, fontsize=14)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wc),
    wc.to_file(name)
    plt.show()

def wcolors(word=None, font_size=None, position=None,  orientation=None, font_path=None, random_state=None):
    colors = ["#7e57c2", "#03a9f4", "#011ffd", "#ff9800", "#ff2079"]
    return np.random.choice(colors)

### Display Word Cloud for Vocabularies Mentioned

* mention labels will be combined into term's label (pref label)
* common terms are removed, like Virus, Diseases, and Bacterium

* Result Saved As PNG Image
    - [Word Cloud For Viruses Mentions](./word-cloud/corona-viruses.png)
    - [Word Cloud For Diseases Mentions](./word-cloud/corona-diseases.png)
    - [Word Cloud For Bacteria Mentions](./word-cloud/corona-bacteria.png)


In [None]:
for vocabulary in vocabularies: 
    # load pre-built DataFrame
    df = read_df(f'../input/cord19precomputeddata/corona_mentioned_with_{vocabulary}_mentioned.df.pklz')
    # get id map for combining text matched
    id_map=all_vocabulary_id_map[f'{vocabulary}']
    # combine text matched to the label of the term.
    df[f'{vocabulary}_label_combined']=df[f'{vocabulary}_mentions'].apply(lambda mentions: [id_map[mention['id']]['properties']['label'] for mention in mentions])
    
    # concatenate all labels
    res = np.concatenate([labels for labels in df[f'{vocabulary}_label_combined']])

    # remove common terms for viruses, diseases and bacteria
    if vocabulary == 'viruses':
        res = [r for r in res if r != 'Coronavirus' and r != 'Virus' ]
    if vocabulary == 'diseases':
        res = [r for r in res if r != 'Disease']
    if vocabulary == 'bacteria':
        res = [r for r in res if r != 'Bacterium']
        
    freqs = pd.Series(res).value_counts()
    wordcloud(freqs, f'word-cloud/corona-{vocabulary}.png', f'Most frequent words for matched {vocabulary}')

# Word Counts Based on Virus, Bacteria and Diseases Vocabularies

In [None]:
# import required libraries
import json
import spacy
from unidecode import unidecode
import glob
import json
import os
import sys
import pickle, gzip
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
nlp = spacy.load("en_core_web_sm")

In [None]:
# read all data files
all_data_files=glob.glob(f'data/biorxiv_medrxiv/biorxiv_medrxiv/pdf_json/*.json', recursive=True)
len(all_data_files)

In [None]:
# read all vocab files
all_vocab_files=glob.glob(f'Owncloud/CORD-19-Hackathon/V-team/vocabulary/*.json', recursive=True)
len(all_vocab_files)
print(all_vocab_files)

In [None]:
# get all labels from all vocab json
all_labels = []
for file in all_vocab_files:
    #print(file)
    with open(file) as json_file :
        virus_data = json.load(json_file)
        max_count = max(virus_data.keys())
        for count in range(0, int(max_count)+1):
            all_labels.append(virus_data["{}".format(count)]["properties"]["label"])
        print(all_labels)

In [None]:
# remove unwanted labels from all_labels
new_all_labels = []
removed_labels = []
unwanted_labels = ['coronavirus','Coronavirus','disease','Disease','bacteria','Bacteria','virus','Virus']
for label in all_labels:
    if label in unwanted_labels:
        removed_labels.append(label)
print("labels removed are - ", removed_labels)
new_all_labels = list(set(all_labels) - set(removed_labels)) 
print(new_all_labels)

In [None]:
# get all vocab labels with synonyms
synonyms = []
synonyms_dict = {}
for file in all_vocab_files:
    print(file)
    with open(file) as json_file :
        virus_data = json.load(json_file)
        max_count = max(virus_data.keys())
        for count in range(0, int(max_count)+1):
            #labels.append(virus_data["{}".format(count)]["properties"]["label"])
            #synonyms.extend(virus_data["{}".format(count)]["properties"]["synonyms"])
            synonyms_dict.update({virus_data["{}".format(count)]["properties"]["label"]: virus_data["{}".format(count)]["properties"]["synonyms"]})
#print(labels)
#print(synonyms)
print(synonyms_dict)

In [None]:
# pickles file generated locally
with gzip.open('NLP/mentions/all_data_corona_mentioned.pklz') as file:
    df=pickle.load(file)
df.head()

In [None]:
corona_mentioned_paper_ids = df['paper_id']
#print(corona_mentioned_paper_ids)
corona_mentioned_paper_ids_list = df['paper_id'].tolist()
print(corona_mentioned_paper_ids_list)

In [None]:
# list of all corona_mentioned_files
corona_mentioned_file_ids = []
corona_mentioned_files = []
for file in all_data_files:
    with open(file) as json_file :
        file_name = file.split("/")[-1].split(".")[0]
        #print(file_name)
        if file_name in corona_mentioned_paper_ids_list:
            corona_mentioned_file_ids.append(file_name)
            corona_mentioned_files.append(file)
#print(corona_mentioned_files)
print("no. of corona_mentioned_files = ", len(corona_mentioned_files))
print("no. of all_data_files = ", len(all_data_files))

In [None]:
# occurence of all labels in corona_mentioned_paper_ids_list

final_dict = {}
id_twc_dict = {}

for file in corona_mentioned_files:
    label_dict = {}
    with open(file) as json_file :
        file_name = file.split("/")[-1].split(".")[0]
        #print(file_name)
        data = json.load(json_file)
        str = ""
        try:
            for k in data['abstract']:
                str = str + k['text']                     
            doc = nlp(unidecode(str)) 
            #print (doc,"\n")
            
            new_all_labels = set(new_all_labels)
            #print(new_all_labels)
            for label in new_all_labels:
                counter=0
                for word in doc:             
                    if(word.text.lower() == label.lower()):
                        counter = counter + 1
                        for lbl in synonyms_dict[label]:
                            if (word.text == lbl):
                                counter = counter + 1
                if len(doc) > 0:
                    freq = (counter/len(doc))*100
                else:
                    freq = 0
                if counter > 0:
                    label_dict.update({label:counter})
                    print(label,"appears in",file_name,counter,"times",", i.e.(",freq,"%, as this text has",len(doc),"\"tokens\")")
                    id_twc_dict.update({file_name:len(doc)})
                    final_dict[file_name] = label_dict
        except:
            print(sys.exc_info[0])

print(id_twc_dict)   
print(final_dict)

with open('total_word_count_in_corona_mentioned_files.json', 'w') as outfile:
    json.dump(id_twc_dict, outfile)

with open('all_labels_occurences_in_corona_mentioned_files.json', 'w') as outfile:
    json.dump(final_dict, outfile)
#print(final_dict)

In [None]:
# pandas df for labels and occurences in corona_mentioned_files
labels_occurences_corona_df = pd.read_json (r'all_labels_occurences_in_corona_mentioned_files.json')
print(labels_occurences_corona_df)
labels_occurences_corona_df.to_csv('labels_occurences_corona_df.csv')

In [None]:
# occurence of all labels in all research papers

final_dict = {}

#id_twc_df = pd.DataFrame(columns=['paper_id','total_word_count'])
id_twc_dict = {}

for file in all_data_files:
    label_dict = {}
    with open(file) as json_file :
        file_name = file.split("/")[-1].split(".")[0]
        #print(file_name)
        data = json.load(json_file)
        str = ""
        try:
            for k in data['abstract']:
                str = str + k['text']                     
            doc = nlp(unidecode(str)) 
            #print (doc,"\n")

            new_all_labels = set(new_all_labels)
            #print(new_all_labels)
            for label in new_all_labels:
                counter=0
                for word in doc:             
                    if(word.text.lower() == label.lower()):
                        counter = counter + 1
                        for lbl in synonyms_dict[label]:
                            if (word.text == lbl):
                                counter = counter + 1
                if len(doc) > 0:
                    freq = (counter/len(doc))*100
                else:
                    freq = 0

                if counter > 0:
                    label_dict.update({label:counter})

                    print(label,"appears in",file_name,counter,"times",", i.e.(",freq,"%, as this text has",len(doc),"\"tokens\")")
                    id_twc_dict.update({file_name:len(doc)})
                    final_dict[file_name] = label_dict
                    #final_dict.update({file_name:label_dict})
        except:
            print(sys.exc_info[0])                

print(id_twc_dict)   
print(final_dict)

with open('total_word_count_in_all_files.json', 'w') as outfile:
    json.dump(id_twc_dict, outfile)

with open('all_labels_occurences_in_all_files.json', 'w') as outfile:
    json.dump(final_dict, outfile)
#print(final_dict)

In [None]:
# pandas df for labels and occurences in all files
labels_occurences_all_df = pd.read_json (r'all_labels_occurences_in_all_files.json')
print(labels_occurences_all_df)
labels_occurences_all_df.to_csv('labels_occurences_all_df.csv')

# Statistical Evaluation

In [None]:
import pandas as pd
import math
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import json

In [None]:
# loading data from precomputed files
root="postsubmission/"
#reading noncommon-use-subset
ncu_labelocc = pd.read_csv(root + "ncu-labels_occurences_all_df.csv")
ncu_labelocc.rename(columns={ ncu_labelocc.columns[0]: "term" }, inplace = True)
ncu_labelocc_c=pd.read_csv(root + "ncu-labels_occurences_corona_df.csv")
ncu_labelocc_c.rename(columns={ ncu_labelocc_c.columns[0]: "term" }, inplace = True)
#ncu_labelocc_c)

#reading biorxiv
bio_labelocc = pd.read_csv(root +"biorxiv-labels_occurences_all_df.csv")
bio_labelocc.rename(columns={ bio_labelocc.columns[0]: "term" }, inplace = True)
bio_labelocc_c=pd.read_csv(root +"biorxiv-labels_occurences_corona_df.csv")
bio_labelocc_c.rename(columns={ bio_labelocc_c.columns[0]: "term" }, inplace = True)
#print(bio_labelocc_c)
#print(ncu_labelocc_c)

In [None]:
# joining ncu and bio dataframes

labelocc = pd.concat([ncu_labelocc, bio_labelocc], ignore_index=True)
#labelocc=ncu_labelocc
labelocc=labelocc.groupby("term").sum().reset_index()
labelocc.dropna(inplace=True)


labelocc_c = pd.concat([ncu_labelocc_c, bio_labelocc_c], ignore_index=True)
#labelocc_c=ncu_labelocc
labelocc_c=labelocc_c.groupby("term").sum().reset_index()
labelocc_c.dropna(inplace=True)


#optional: save joined dfs to files
#labelocc.to_csv("labels_occurences_corona_df_allds.csv")
#labelocc_c.to_csv("labels_occurences_all_df_allds.csv")

#import precomputed abstract word counts
with open(root +'ncu-bio-total_word_count_in_all_files.json', 'r') as f:
    wordcountdict = json.load(f)

#print(len(wordcountdict))

In [None]:
# collecting terms
terms = list(labelocc_c["term"])

# collecting IDs of all papers, whole subset
IDList = list(labelocc)
IDList.pop(0)

# collecting IDs of all papers, corona subset
IDList_c = list(labelocc_c)
IDList_c.pop(0)



In [None]:
#print(len(IDList_c))
#print(len(IDList))
#print(len(IDrest))

IDrest = []
nomatch = 0
for paper1 in IDList:
    nomatch = 0
    for paper2 in IDList_c:
        if paper1!=paper2:
            nomatch +=1
            if nomatch==len(IDList_c):
                IDrest.append(paper1)
                


In [None]:
def getWordCount(term, IDList, sourcedf) :
    wordcount = 0
    for paper in IDList:
        #print(paper)
        wc_paperID = sourcedf[paper][sourcedf["term"] == term].values[0]
        #print(wc_paperID)
        wordcount = wordcount + wc_paperID
    return int(wordcount)


def getTotWords(term, IDListin, sourcedf, sourcedict=wordcountdict) :
    total_words = 0;
    tokens_of_term = len(term.split())
    #add check if 
    for paper in IDListin :
        #wctot_paperID = sourcedf[sourcedf["term"] == term][paper][0]
        wctot_paperID = sourcedict[paper]
        #print(wctot_paperID)
        total_words = total_words + wctot_paperID
        #if tokens_of_term > 1:
        #    total_words = total_words + int(wctot_paperID)
        #    #compute the total number of tokens considering the overlapping window
        #    total_words = wctot_paperID - tokens_of_term +1
    return total_words


#print(getTotWords("Influenza", IDList_c, labelocc_c))

In [None]:
EFDict = {}; pDict = {}; logEFDict ={};					# Other variables

####### Probability of finding the term corona in entire literature #######

# taken from previous calculations (see above)
corona_count = 11130 # mentions of "corona" or it's synonyms in all biorxiv and non-common-subset articles
corona_tot_wc = getTotWords("Corona", IDList, labelocc) # total word count of biorxiv and non-common-subset articles, that mentioned corona
 
pCorona = corona_count / corona_tot_wc

print ("Prob of finding the term corona in entire literature = " + str(pCorona))

###############################################################################

for term in terms: 
    if True :
        wc1 = getWordCount(term, IDList, labelocc); #print (wc1)
        wc_tot = getTotWords(term, IDList, labelocc);
        pL = wc1 / wc_tot;  # Prob of finding the term across entire Lit
        #print ("Prob of finding the term "+term+" in the entire literature =" + str(pL))
        #print(wc1)
        #print(wc_tot)
        #print(pL)

        wc2 = getWordCount(term, IDList_c, labelocc_c)
        wc_totC = getTotWords(term, IDList_c, labelocc_c)
        pC = wc2 / wc_totC   # Prob of finding the term across CORONA Lit
        #print ("Prob of finding the term "+term+" in the CORONA literature =" + str(pC))
        #print(wc2)
        #print(wc_totC)
        #print(pC)

    ######## Preferential association of vocab terms with CORONA literature #########
        if pL != 0:
            EF = pC/pL
            #print ("CORONA Lit has "+str(EF)+ " fold higher probability of finding the term "+term)
            if EF>0: 
                #print (math.log10(EF))  # as Log10(0) is undefined
                EFDict[term] = EF
                logEFDict[term] = math.log10(EF)
            posteriorT = (pC*pCorona)/pL
            #print(posteriorT)
            #print(pCorona)
            pDict[term] = posteriorT
        else :
            #print ("The term "+term+" does not appear in any literature and hence will be neglected!\n")
            pass



In [None]:
#collecting results and saving to a dataframe
resDf = pd.DataFrame(columns=["term", "EF", "EFlog", "posterior", ])
for term in terms:
    resDf = resDf.append({"term": term, "EF":EFDict[term], "EFlog": logEFDict[term], "posterior":pDict[term]}, ignore_index=True)
resDf = resDf.sort_values("EF", ascending=False).reset_index(drop=True)   
print(resDf)


In [None]:
plt.figure(figsize=(10,20))


ax=sns.barplot(data=resDf, x="EF", y="term")
ax.set_ylabel("")
ax.set_xlabel("Enrichment Factor")
ax.axvline(1, ls='--')
