# We will focus on discovering insights about the abstracts, specifically what they are primarily talking about and use that information to match / group abstracts that are similar 

# We will also attempt to demonstrate how we can automatically extract information about experimental findings and claims made by researchers


The challenge here is dealing with scientific terms. We do not know specifically what to look for that makes sense to us non-virologists or non-epidemiologists. E.g. searching for incubation or virus will return several matches used in different contexts that may not be useful to an expert looking at our analysis.

Instead of looking for specific words such as virus, vaccine or incubation, given the lack of domain expertise here, I would instead prefer to determine important words or phrases from the text.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
    #for filename in filenames:
        #print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

But first, lets get the documents processed. I am going to reuse the preprocessing code from a kernel in this challenge.
Using the preprocessor from the kernel https://www.kaggle.com/maksimeren/covid-19-literature-clustering for processing the dataset

In [None]:
#ref: https://www.kaggle.com/maksimeren/covid-19-literature-clustering
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'

In [None]:
#ref: https://www.kaggle.com/maksimeren/covid-19-literature-clustering
import glob
import json

metadata_path = "/kaggle/input/CORD-19-research-challenge/metadata.csv"
metadata_df = pd.read_csv(metadata_path)

all_json = glob.glob("/kaggle/input/CORD-19-research-challenge/**/*.json", recursive=True)

dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [], 'abstract_summary': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    
    # get metadata information
    meta_data = metadata_df.loc[metadata_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
    
    # also create a column for the summary of abstract to be used in a plot
    if len(content.abstract) == 0: 
        # no abstract provided
        dict_['abstract_summary'].append("Not provided.")
    else:
        # abstract is short enough
        summary = content.abstract
        dict_['abstract_summary'].append(summary)
        
    # get metadata information
    meta_data = metadata_df.loc[metadata_df['sha'] == content.paper_id]
    
    try:
        # if more than one author
        authors = meta_data['authors'].values[0].split(';')
        if len(authors) > 2:
            # more than 2 authors, may be problem when plotting, so take first 2 append with ...
            dict_['authors'].append(". ".join(authors[:2]) + "...")
        else: # modifying from the above mentioned kernel, since we are not interested in similar plotting
            # authors will fit in plot
            dict_['authors'].append(". ".join(authors))
    except Exception as e:
        # if only one author - or Null valie
        dict_['authors'].append(meta_data['authors'].values[0])
    
    # add the title information, add breaks when needed
    try:
        title = meta_data['title'].values[0]
        dict_['title'].append(title)
    # if title was not provided
    except Exception as e:
        dict_['title'].append(meta_data['title'].values[0])
    
    # add the journal information
    dict_['journal'].append(meta_data['journal'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'authors', 'title', 'journal', 'abstract_summary'])
df_covid.head()

Now that we have processed the documents, we will now start the EDA that our kernel focusses on. I will focus on the abstract and leave the body alone for now. Let' see if any data clean up is necessary.

In [None]:
df_covid.isna().sum()

None of the abstract or paper_id fields are null. So we are good to proceed with our EDA. Let's begin by looking at the first abstract we have here.

In [None]:
df_covid.loc[df_covid['paper_id']=='f318f417880d9beb2ce5c8444f3597a8808eae30', ['abstract']]['abstract'].values[0]

A few important observations:
* Scientific terms are used extensively
* Short form of phrases are used after being defined once, e.g. HBV, PTT22-vector etc.
* Some findings is also reported e.g. *"Therefore, α-mannosidase I may be a novel drug target..."*

Typically, we approach the problem by extracting words from the documents and using those words to perform several tasks such as
* finding most frequently used words
* findings documents that are similar to each other based on the words used

We will use a slightly different approach here. If you read the abstract carefully, you will notice that it repeatedly refers to 'bone repair' or 'bone regeneration'. If we build a feature extractor that extracts these important keywords that provide us an insight about what the abstract is 'in general' referring to, we can limit our feature set to a manageable size and also remove a lot of noise.

We will start with a a simple approach to find words that are frequently used in an abstract

In [None]:
#nltk.download('stopwords')
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords 
import pandas

def get_important_words(doc, threshold=3):
    df = pd.DataFrame(columns=['word', 'count'])
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(doc)
    #get rid of specical characters
    words = [word.lower() for word in tokens if word.isalpha()]
    filtered_sentence = [] 
    #get rid of stop words
    filtered_sentence = [w for w in words if not w in stop_words] 
    if len(words) <= 0: return df;
    for w in words: 
        if w not in stop_words: 
            filtered_sentence.append(w) 

    fdist = nltk.FreqDist(filtered_sentence)
   
    #find frequency distribution of words
    for key in fdist:
        word = key
        cnt = fdist[key]
        d = {'word': word, 'count': cnt}
        df = df.append(d, ignore_index=True)
    #find words that are used 'way' more than others
    arr = df.sort_values(by=['count'], ascending=False)['count'].values
    outliers = []
    mean_1 = np.mean(arr)
    std_1 = np.std(arr)
    for y in arr:
        z_score= (y - mean_1)/std_1 
        if np.abs(z_score) > threshold:
            outliers.append(y)  
    if(len(outliers)>0):
        df['outlier'] = df['count'] >= outliers[-1]
    else:
        df['outlier'] = False
    return_df = df.loc[df['outlier'] == True, ['word']]
    return return_df

Let's see how our routine did on the first abstract

In [None]:
doc = df_covid.loc[df_covid['paper_id']=='f318f417880d9beb2ce5c8444f3597a8808eae30', ['abstract']]['abstract'].values[0]
get_important_words(doc, 3)

For a threshold of 3, it was able to identify the short notation for **hepatitis B virus (HBV)**
If we relax the threshold, can we do a little better?

In [None]:
doc = df_covid.loc[df_covid['paper_id']=='f318f417880d9beb2ce5c8444f3597a8808eae30', ['abstract']]['abstract'].values[0]
get_important_words(doc, 2)

We are now able to identify that the abstract is about HBV and MK886 (PPARα) and also about 'expressions' (gene?). We would have ideally preferred identify the context in which the word expression is used. We will get to that later in the notebook. For now, we will focus on single words only.
### Random sample some abstracts from the corpus and see if such words seen across abstracts provide us any useful information about  the important words used frequently in this type of research

In [None]:
import numpy as np
import random
def get_important_words_per_document(df_all, num_samples):
    random.seed(1234)
    idx = np.arange(df_all.shape[0])
    random.shuffle(idx)
    df = pd.DataFrame(columns=['p_id','word'])
    for indx, (p_id, abstr) in df_covid.loc[idx[:num_samples]][['paper_id', 'abstract']].iterrows():
        df_temp = get_important_words(abstr, 4)#increase the threshold
        df_temp['p_id'] = p_id
        df = df.append(df_temp[['p_id','word']])
    return df

In [None]:
df_imp_words_all_docs = get_important_words_per_document(df_covid, 500)

Let's see how we did on these randomly sampled 500 abstracts

In [None]:
df_imp_words_all_docs.groupby(['word']).count().reset_index().sort_values(by=['p_id'], ascending=False).head(15)

We find several words that are imporant in the context of a single abstract that are also imporant in the context of other abstracts. The intuition here is that these abstracts identified by the p_id in the dataset above refer extensively to the words mentioned against it in the dataset.

In [None]:
df_imp_words_all_docs[df_imp_words_all_docs['word'] == 'sars']['p_id']

All these documents (in the randomly sampled 500 abstracts) talk extensively abour 'sars'. Let's verify that.

In [None]:
df_covid.loc[df_covid['paper_id']=='ce708dd37870908f94d2e5845c963cfadaa38b0d', 'abstract'].values[0]

In [None]:
df_covid.loc[df_covid['paper_id']=='25cc93bafacf163c6e315809b41ef6d814c15b15', 'abstract'].values[0]

Let's now look at how the important words are distributed among the 500 randomly selected documents

In [None]:
data = df_imp_words_all_docs.groupby(['word']).count().reset_index().sort_values(by=['p_id'], ascending=False)[:50]
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(1,1, figsize=(10,10))
ax.barh(data['word'], data['p_id'])
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Ocurrence')
ax.set_title('Important (top 50) abstract words that occur frequently across corpus')

At the outset, the first few words such as 'patients' and 'infection' may not seem very useful, but if you look carefully, there are quite a few important words such as 'pigs', 'respiratory', 'amplification' etc. that may we very useful for researchers if they wish to refer to abstracts that extensively mention those words.

We will now use some similarity detection techniques so that we can find documents that not only refer to the identical word, but similar words so that we can broaden our match.
I will use a similarity computation routine that is available here https://www.programcreek.com/python/example/91606/nltk.corpus.wordnet.wup_similarity

In [None]:
#https://www.programcreek.com/python/example/91606/nltk.corpus.wordnet.wup_similarity
from nltk.corpus import wordnet
import itertools
def similarity(word1, word2):
    allsyns1 = set(ss for ss in wordnet.synsets(word1))
    allsyns2 = set(ss for ss in wordnet.synsets(word2))
    try:
        best = max((wordnet.wup_similarity(s1, s2) or 0) for s1, s2 in 
                itertools.product(allsyns1, allsyns2))
    except: 
        best = 0
    return best

Let's see how this routine does

In [None]:
print(similarity('infection','disease'))
print(similarity('one','four'))
print(similarity('dna','rna'))
print(similarity('pig','bat'))
print(similarity('influenza','virus'))

The routine seems to be able to correctly identify that infection and disease are used in similar contexts. Again examples such as 'one' and 'four', 'dna' and 'rna' further confirm the usage of the routine. However 'influzenza' and 'virus' have low similarity because even though they may be used together in several literatures, they are never used in the same context.

We will go back to the randomly sampled 500 abstracts *df_imp_words_all_docs* and try this routine on the unique important words seen in that dataset.
For this we need to write a routine to find the closest word (based on the *similarity* function) for any given word.

In [None]:
def find_closest_words(wrd, wordlist):
    df = pd.DataFrame(columns=['word','similarity'])
    for word in wordlist:
        sim = similarity(wrd, word)
        d = {'word': word, 'similarity': sim}
        df = df.append(d, ignore_index=True)
    return df;

Let's see if we can figure out what words there are in our *limited* dataset that are similar to the word 'antiviral'

In [None]:
df_closest = find_closest_words('antiviral', df_imp_words_all_docs['word'].unique())
df_closest.sort_values(by=['similarity'], ascending=False).head(10)

We seem to have done well. We have 'lozenge', 'vaccines', 'assay' etc. that seem relevant here. 
We do seem some issues such as 'cat' that may be a noise and we can investigate those anomalies to improve on our similarity detection task.

Let's look at the distribution of the similarity measures with respect to the word 'antiviral'

In [None]:
df_closest.sort_values(by=['similarity'], ascending=False)['similarity'].hist(bins=50)

There are several values that are close to zero or 'small' while only a few that are 'large'. This indicates that we might need a mechanism to choose a cut-off for picking a similarity threshold value. Let's write a routine to find similar words based on a cut_off or threshold (set to 98 percentile value of the similarity measure)

In [None]:
def find_closest_words_with_cutoff(wrd, wordlist, cut_off):
    df = pd.DataFrame(columns=['word','similarity'])
    for word in wordlist:
        sim = similarity(wrd, word)
        d = {'word': word, 'similarity': sim}
        df = df.append(d, ignore_index=True)
    if(df.shape[0] <=0): return df
    vals = df['similarity'].values
    cut = np.quantile(vals, cut_off)
    #print(df.sort_values(by=['similarity'], ascending=False).quantile(cut_off))
    #cut_off = df.sort_values(by=['similarity'], ascending=False).quantile(cut_off).values[0]
    #df = df.sort_values(by=['similarity'], ascending=False)
    return df.loc[df['similarity'] >= cut, ];

In [None]:
find_closest_words_with_cutoff('virus', df_imp_words_all_docs['word'].unique(), .98)

We have done quite well here as seen from the above results. We may need to further take a second pass on the similarity values if a word is quite unique in a given abstract and has no similar matches elsewhere in the corpus, in which case, even with a cut-off, the similarity measure may be poor.

#### Lets get back to our important words document list and determine other documents that use similar important words
We write a new routine that will first find words in the dataset across abstracts that are similar and then using the information about the abstracts that use those words will find abstracts that are similar with respect to those words.

In [None]:
def find_closest_word_and_document(p_id, word, unique_words, all_docs):
    df_temp = pd.DataFrame(columns=['p_id','word','sim_word','ref_p_id'])
    df_closest = find_closest_words_with_cutoff(word, unique_words, .98)
    if(df_closest.shape[0] <= 0): return df_temp
    if(df_closest['similarity'].max() < .7): return df_temp
    for wrd in df_closest['word']:
        d = pd.DataFrame(columns=['p_id','word','sim_word','ref_p_id'])
        d['ref_p_id'] = all_docs.loc[all_docs['word'] == wrd,['p_id']]['p_id']
        d['sim_word'] = wrd
        #todo: store the similarity values too, this will help with comparison and possible distance plots
        df_temp = df_temp.append(d)
    df_temp['p_id'] = p_id
    df_temp['word'] = word
    return df_temp

Get a list of unique words in the corpus of the limited (randomly sampled 500 abstracts) dataset and find documents that are using some specific words or similar words. Let's try with the word 'transmission'.

In [None]:
unique_words = df_imp_words_all_docs['word'].unique()
find_closest_word_and_document('01c47a7e53b4cf4783d55125936061e2ca0d9817', 'transmission', unique_words, df_imp_words_all_docs)

We do find several abstracts that refer to transmission. Note that this is not the entire corpus of 29K abstracts. This is only a 2% sample of the corpus, so the match is not exhaustive. Despite the fact, we still manage to find several abstracts that refer to terms similar to 'transmission'.

In [None]:
find_closest_word_and_document('01c47a7e53b4cf4783d55125936061e2ca0d9817', 'incubation', unique_words, df_imp_words_all_docs)

We did find several matches for incubation, however not many look very relevant. Even though 'binding' is a word used in the same context as 'incubation', the rest don't look that useful. 
The above routine uses only one word and finds words similar to it across the abstracts in the sub-sample. We will enhance the routine to use all the important words in an abstract and then compare other abstract that use words similar to any of the important words used in the abstract.

In [None]:
#enhanced similar word search supporting all imp. words in a document
def find_closest_word_and_document_v2(x, unique_words, all_docs):
    #p_id = x.reset_index()['p_id'][0]
    #df = pd.DataFrame(columns=['p_id','word','sim_word','ref_p_id'])
    df = pd.DataFrame(columns=['word','sim_word','ref_p_id'])
    for w in x['word']:
        #df_temp = pd.DataFrame(columns=['p_id','word','sim_word','ref_p_id'])
        df_temp = pd.DataFrame(columns=['word','sim_word','ref_p_id'])
        df_closest = find_closest_words_with_cutoff(w, unique_words, .98)
        if(df_closest.shape[0] <= 0): return df_temp
        #for very low similarities, skip further processing
        if(df_closest['similarity'].min() < .7): return df_temp
        for wrd in df_closest['word']:
            #d = pd.DataFrame(columns=['p_id','word','sim_word','ref_p_id'])
            d = pd.DataFrame(columns=['word','sim_word','ref_p_id'])
            d['ref_p_id'] = all_docs.loc[all_docs['word'] == wrd,['p_id']]['p_id']
            d['sim_word'] = wrd
            d['word'] = w
            #d['p_id'] = p_id      
            df_temp = df_temp.append(d)  
        #df_temp['word'] = w
        #df_temp['p_id'] = p_id
        #display(df_temp)
        df = df.append(df_temp)
    #df['p_id'] = p_id
    return df_temp

#### Find abstracts that are similar to each other with respect to the important words used, using word similarity measures 
An abstract may have more then one important word, and we plan to use all of their similarity measures to compute the similarity between abstracts.

In [None]:
#build the dataframe with all similar abstracts wrt important words
#df_imp_words_all_docs contains information about 500 randomly chosen abstracts
df_wo_index = df_imp_words_all_docs.reset_index(drop=True)
unique_words = df_imp_words_all_docs['word'].unique()
df_sim_matrix = df_wo_index[0:500].groupby('p_id').apply(lambda x: find_closest_word_and_document_v2(x,unique_words,df_imp_words_all_docs)).reset_index()

In [None]:
#imp_word = 'incubation'
data = df_sim_matrix.groupby('word')\
    .apply(lambda x: len(np.unique(x['ref_p_id']))).reset_index().sort_values(by=[0], ascending=False)
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(1,1, figsize=(10,25))
ax.barh(data['word'], data[0])
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Ocurrence')
ax.set_title('Important words / similar words that occur frequently across abstracts')

Above, we see the distribution of all important words across the sub-sample of the randomly chosen 500 abstracts. 
Note: This is not a simple distribution of words across the corpus. In fact, the words corresponing to the larger 'bar' indicate the number of abstracts that extensively discuss those terms. We will look at an example:

In [None]:
df_sim_matrix[df_sim_matrix['word']=='transmission']

'transmission' and other words used in similar context has been used extensively in 13 of these 500 abstracts. Let's look at one of them, the one corresponding to 'spreading'

In [None]:
df_covid.loc[df_covid['paper_id']=='95d070f39f49f5d56d1330a2056f2e953d37af0f', 'abstract'].values[0]

In [None]:
df_covid.loc[df_covid['paper_id']=='d04d63e56673f57ed326ebf2314e5b8192266a79', 'abstract'].values[0]

As seen from the above abstract, it does primarily talk about spreading and transmission. Another example is from the word 'contacts', let's examine it below.

In [None]:
df_covid.loc[df_covid['paper_id']=='473c721f42096f1b8450c669b607486841a5f72a', 'abstract'].values[0]

Clearly, this absract does talk about studies conducted to determine the nature of human to human contact that can promote transmission.
Now that we have a fairly good approach to extracting abstracts that are primarily referring to specific important words or similar words in similar context. Let's run this at scale and see what we can find about our specific research questions.

In [None]:
#use all available data to build an extensive similarity matrix
df_wo_index_full = df_imp_words_all_docs.reset_index(drop=True)
unique_words_full = df_imp_words_all_docs['word'].unique()
df_sim_matrix_full = df_wo_index_full.groupby('p_id').apply(lambda x: find_closest_word_and_document_v2(x,unique_words_full,df_imp_words_all_docs)).reset_index()

In [None]:
df_sim_matrix_full.shape

In [None]:
#imp_word = 'incubation'
data = df_sim_matrix_full.groupby('word')\
    .apply(lambda x: len(np.unique(x['ref_p_id']))).reset_index().sort_values(by=[0], ascending=False).head(50)
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(1,1, figsize=(10,25))
ax.barh(data['word'], data[0])
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Ocurrence')
ax.set_title('Important words / similar words that occur frequently across abstracts')

In [None]:
'incubation' in unique_words_full

The word 'incubation' is not in unique important words list, so we will find words that are similar to 'incubation' in the unique words list

In [None]:
sim = []
for wrd in unique_words_full:
    sim.append(similarity('incubation',wrd))
df_wrd_sim = pd.DataFrame(columns=['word', 'sim'])
df_wrd_sim['word'] = unique_words_full
df_wrd_sim['sim'] = sim
vals = df_wrd_sim.sort_values(by=['sim'], ascending = False)['sim'].values
cut_off = np.quantile(vals, .98)
relevant_words = df_wrd_sim.loc[df_wrd_sim['sim']>cut_off, ]['word']

We will look for these relevant words in the similarity matrix

In [None]:
df_abstract_word = pd.DataFrame(columns = ['p_id', 'word'])
for word in relevant_words:
    df_select_columns = df_sim_matrix.loc[df_sim_matrix['word']==word, ]
    d = pd.DataFrame(columns=['p_id', 'word'])
    #d = {'p_id': df_select_columns['p_id'].unique(), 'word': word}
    d['p_id'] = df_select_columns['ref_p_id']
    d['word'] = df_select_columns['sim_word']
    df_abstract_word = df_abstract_word.append(d)
    #docs.extend(df_select_columns['p_id'].unique())
    #docs.extend(df_select_columns['ref_p_id'].tolist())
print(df_abstract_word)

In [None]:
sim = []
for wrd in unique_words_full:
    sim.append(similarity('incubation',wrd))
df_wrd_sim = pd.DataFrame(columns=['word', 'sim'])
df_wrd_sim['word'] = unique_words_full
df_wrd_sim['sim'] = sim
vals = df_wrd_sim.sort_values(by=['sim'], ascending = False)['sim'].values
cut_off = np.quantile(vals, .995) #changing this value to a higher number
relevant_words = df_wrd_sim.loc[df_wrd_sim['sim']>cut_off, ]['word']

In [None]:
df_abstract_word = pd.DataFrame(columns = ['p_id', 'word'])
for word in relevant_words:
    df_select_columns = df_sim_matrix.loc[df_sim_matrix['word']==word, ]
    d = pd.DataFrame(columns=['p_id', 'word'])
    #d = {'p_id': df_select_columns['p_id'].unique(), 'word': word}
    d['p_id'] = df_select_columns['ref_p_id']
    d['word'] = df_select_columns['sim_word']
    df_abstract_word = df_abstract_word.append(d)
    #docs.extend(df_select_columns['p_id'].unique())
    #docs.extend(df_select_columns['ref_p_id'].tolist())
print(df_abstract_word)

Let's look the abstract that refers to 'contact'

In [None]:
df_covid.loc[df_covid['paper_id']=='5eb34e4b386106962c368bb7c32db8995190e5c6', 'abstract'].values[0]

Apparently, we have an abstract here that is discussing the contact patterns that determine transmission which is remotely related to incubation.

Let's put this together in a routine

In [None]:
def find_abstracts_discussing_specific_terms(term):
    sim = []
    for wrd in unique_words_full:
        sim.append(similarity(term,wrd))
    df_wrd_sim = pd.DataFrame(columns=['word', 'sim'])
    df_wrd_sim['word'] = unique_words_full
    df_wrd_sim['sim'] = sim
    vals = df_wrd_sim.sort_values(by=['sim'], ascending = False)['sim'].values
    cut_off = np.quantile(vals, .995) #changing this value to a higher number
    relevant_words = df_wrd_sim.loc[df_wrd_sim['sim']>cut_off, ]['word']
    df_abstract_word = pd.DataFrame(columns = ['p_id', 'word'])
    for word in relevant_words:
        df_select_columns = df_sim_matrix.loc[df_sim_matrix['word']==word, ]
        d = pd.DataFrame(columns=['p_id', 'word'])
        #d = {'p_id': df_select_columns['p_id'].unique(), 'word': word}
        d['p_id'] = df_select_columns['ref_p_id']
        d['word'] = df_select_columns['sim_word']
        df_abstract_word = df_abstract_word.append(d)
        #docs.extend(df_select_columns['p_id'].unique())
        #docs.extend(df_select_columns['ref_p_id'].tolist())
    return df_abstract_word

Let's check for the term 'transmission'

In [None]:
df = find_abstracts_discussing_specific_terms('transmission')
print(df['word'].unique())

We seem to have done a fairly good job in finding abstracts that discuss topics similar to transmissions.

### Analyzing sentences and the contexts
We will now attempt to extract some context and if possible meaningful phrases from sentences used in the abstracts.
First, let's go back to our sample abstract and figure out what type of analysis we would like to perform.

In [None]:
df_covid.loc[df_covid['paper_id'] == '0015023cc06b5362d332b3baf348d11567ca2fbb', ['abstract']].values[0][0]

We will use the NLTK package to extract some of the important POS (Part-Of-Speech) from the text here.

In [None]:
import nltk
from nltk import word_tokenize
doc = df_covid.loc[df_covid['paper_id'] == '0015023cc06b5362d332b3baf348d11567ca2fbb', ['abstract']].values[0][0]
tokens = word_tokenize(doc)
tagged_tokens = nltk.pos_tag(tokens)
fdist = nltk.FreqDist(tagged_tokens)
df = pd.DataFrame(columns=['word', 'pos'])
for key in fdist:
    word = key[0]
    pos = key[1]
    cnt = fdist[key]
    d = {'word': word, 'pos': pos}
    df = df.append(d, ignore_index=True)
print("Unique POS =", df['pos'].nunique())

There are 25 unique POS that were found in the text. Let's look at their distribution

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(1,1, figsize=(6,5))
plot_data = df.groupby('pos').count().reset_index().sort_values(by = ['word'], ascending=False)
ax.invert_yaxis()
ax.set_xlabel("Ocurrence of POS tags")
ax.set_title("Words by POS tags")
ax.barh(plot_data['pos'], plot_data['word'])

There are largely nouns ('NN') and a very few personal pronouns (PRP). **The PRP's might used to refer to claims by researchers**, so we may want to take a closer look at them. The number of proper nouns (NNP) are far less when compates to the NN's. Let's look at those.

In [None]:
df[df['pos']=='NNP'].head()

Interestingly, these NNPs are very useful in extracting the scientific notations or short forms of specific scientific terms being discussed in the abstract.

In [None]:
df[df['pos'] == 'NNP'].groupby('word').count().reset_index().sort_values(by=['pos'], ascending=False).head()

We can only look at NNPs and look at the ones that are used more often in the abstract. However, the count might still be unmanageable as we cover the larger corpus so we may need to apply a cut off to select only a few NNPs per abstract. We use our previous method of selecting impotant terms to selecting important NNPs.

In [None]:
import pandas
def freq_dist_pos_by_sents(doc): #todo: handle a regex for the POS
    sentences = nltk.sent_tokenize(doc)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    df = pd.DataFrame(columns=['word', 'pos'])
    for sent in sentences:
        tagged_tokens = nltk.pos_tag(sent)
        fdist = nltk.FreqDist(tagged_tokens)
        for key in fdist:
            word = key[0]
            pos = key[1]
            cnt = fdist[key]
            d = {'word': word, 'pos': pos}
            df = df.append(d, ignore_index=True)
    return df;

In [None]:
def find_important_pos(doc, pos, threshold=2):
    # grammar = "CHUNK: {<NN|NNP|NNS|NP><NN|NNP|NNS|NP>}  # Chunk two consecutive nouns"
    df = freq_dist_pos_by_sents(doc)
    df = df.loc[df['pos'] == pos,]
    df_ordered = df.groupby(['word']).count().reset_index().sort_values(['pos'], ascending=False)
    df_word_weight = df_ordered\
        .apply(lambda x: pd.Series([x['word'], x['pos']/df_ordered['word'].shape[0],x['pos']], \
                                    index=['word','weight','pos']), axis = 1)
    outliers = []
    if ((df_word_weight.shape[0] > 0) & ('weight' in df_word_weight.columns.tolist())):
        arr = df_word_weight['weight'].values
        #find outliers
        mean_1 = np.mean(arr)
        std_1 = np.std(arr)
        for y in arr:
            z_score= (y - mean_1)/std_1 
            if np.abs(z_score) > threshold:
                outliers.append(y)  
        if(len(outliers)>0):
            df_word_weight['outlier'] = df_word_weight['weight'] >= outliers[-1]
        else:
            df_word_weight['outlier'] = False
        return_df = df_word_weight.loc[df_word_weight['outlier'] == True, ['word']]
    else:
        df_word_weight['word'] = None
        return_df = df_word_weight.loc[:, ['word']]
    return return_df

In [None]:
doc = df_covid.loc[df_covid['paper_id'] == '0015023cc06b5362d332b3baf348d11567ca2fbb', ['abstract']].values[0][0]
find_important_pos(doc, 'NNP', 2)

Find **important proper nouns** across abstracts and see how they are distributed 

In [None]:
import random
random.seed(1234)
idx = np.arange(df_covid.shape[0])
random.shuffle(idx)
df_imp_pos = pd.DataFrame(columns=['p_id','word','pos'])
for i_d, (abstr, p_id) in df_covid.loc[idx[0:1000],['abstract','paper_id']].iterrows():
    df = find_important_pos(abstr, 'NNP', 2)
    df['p_id'] = p_id
    df['pos'] = 'NNP'
    df_imp_pos = df_imp_pos.append(df, sort=False)

In [None]:
data_plot = df_imp_pos.groupby('word')\
    .p_id.nunique().reset_index()\
    .sort_values(by=['p_id'], ascending=False).head(20)
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(1,1, figsize=(8,6))
ax.barh(data_plot['word'], data_plot['p_id'])
ax.invert_yaxis()
ax.set_xlabel("Ocurrence of sci. proper nouns")
ax.set_title("Ocurrence of top 20 sci. terms used as proper nouns across a sample of 1000 abstracts")

We can tell from the above that in the sub-sample we are examining, RNA, SARS, MERS-CoV, FIPV etc. have been extensively discussed. We can easily identify the abstracts that primarily discuss these dieases and help a researcher locate useful information from these. Let's see if we can take a 2nd pass at these abstracts to identify the context in which these terms are used.

In [None]:
def get_usage_context_pos(doc, grammer):
    sentences = nltk.sent_tokenize(doc)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    #tagged_tokens = nltk.pos_tag(tokens)
    #use simple adjective and noun, and noun -> proper noun
    #grammar = "CHUNK: {<VB*>?<JJ*>?<NN|NP|NNP|NNS|NNPS>?<NNP>+}"
    df = pd.DataFrame(columns=['phrase', 'pos'])
    for sent in sentences:
        tokens = nltk.FreqDist(sent)
        tagged_tokens = nltk.pos_tag(tokens)
        #use simple adjective and noun, and noun -> proper noun
        #grammar = "CHUNK: {<DT>?<JJ><NN|NP|NNP|NNS|NNPS>?<NN|NP|NNP|NNS|NNPS>}"
        #grammar = "CHUNK: {<PRP><VBP>?<IN|DT>*<NN|NP|NNP|NNS|NNPS>?<NN|NP|NNP|NNS|NNPS>}"
        cp = nltk.RegexpParser(grammar)
        tree = cp.parse(tagged_tokens)

        for subtree in tree.subtrees():
            if subtree.label() == 'CHUNK':
                #join words into one n-gram (depending upon how many matched the regex)
                word = ''
                for leaf in subtree.leaves():
                    word += leaf[0] + " "  
                d = {'phrase': word.strip(), 'pos': 'CHUNK'}
                df = df.append(d, ignore_index=True)
    return ", ".join(df.loc[:,['phrase']]['phrase'].values.tolist())

In [None]:
grammar = "CHUNK: {<VB*>?<JJ*>?<NN|NP|NNP|NNS|NNPS>?<NNP>+}"
#looking for abstracts that primarily discuss RNA, but looking at the proper noun used in conjunction with an adjective or other nouns
for p_id in df_imp_pos.loc[df_imp_pos['word']=='RNA', ['p_id']].values:
    print("_________________")
    doc = df_covid.loc[df_covid['paper_id'] == p_id[0], ['abstract']].values[0][0]
    print(get_usage_context_pos(doc, grammar))

There quite a few important matches here where RNA is used in conjunction with some adjectives that provide a little more context to the terms such as "positive-strand RNA" (which is mentioned more than once in the sub-sample of the abstracts), "single-stranded RNA" etc. We could optionally repeat the above process of POS tag matching to include an adjective and a noun that would potentially match for such bigrams and perhaps be more precise in information extraction.

In [None]:
#complex phrases when scanned by sentences
import pandas
def freq_dist_pos_complex_by_sent(doc, grammer, threshold): #todo: handle a regex for the POS
    sentences = nltk.sent_tokenize(doc)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    #use simple adjective and noun, and noun -> proper noun
    #grammar = "CHUNK: {<DT>?<JJ><NN|NP|NNP|NNS|NNPS>?<NN|NP|NNP|NNS|NNPS>}"
    df = pd.DataFrame(columns=['phrase', 'pos'])
    for sent in sentences:
        #tokens = nltk.FreqDist(sent)
        tagged_tokens = nltk.pos_tag(sent)
        #use simple adjective and noun, and noun -> proper noun
        #grammar = "CHUNK: {<DT>?<JJ><NN|NP|NNP|NNS|NNPS>?<NN|NP|NNP|NNS|NNPS>}"
        #grammar = "CHUNK: {<PRP><VBP>?<IN|DT>*<NN|NP|NNP|NNS|NNPS>?<NN|NP|NNP|NNS|NNPS>}"
        cp = nltk.RegexpParser(grammar)
        tree = cp.parse(tagged_tokens)
    
        for subtree in tree.subtrees():
            if subtree.label() == 'CHUNK':
                #join words into one n-gram (depending upon how many matched the regex)
                word = ''
                leaf_count = len(subtree.leaves())
                for leaf in subtree.leaves():
                    word += leaf[0] + " "  
                d = {'phrase': word.strip(), 'pos': 'CHUNK', 'leaf_count': leaf_count}
                df = df.append(d, ignore_index=True)
    return df

In [None]:
grammar = "CHUNK: {<VB*>?<JJ*>?<NN|NP|NNP|NNS|NNPS>?<NNP>+}"
import random
random.seed(1234)
idx = np.arange(df_covid.shape[0])
random.shuffle(idx)
df_imp_pos = pd.DataFrame(columns=['p_id','pos','phrase','leaf_count'])
for i_d, (abstr, p_id) in df_covid.loc[idx[0:1000],['abstract','paper_id']].iterrows():
    df = freq_dist_pos_complex_by_sent(abstr, grammar, 2)
    df['p_id'] = p_id
    df_imp_pos = df_imp_pos.append(df, sort=False)
df_imp_pos.head()

In [None]:
data_plot = df_imp_pos.loc[df_imp_pos['leaf_count']>=2, ].groupby('phrase')\
    .p_id.nunique().reset_index()\
    .sort_values(by=['p_id'], ascending=False).head(30)
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(1,1, figsize=(8,6))
ax.barh(data_plot['phrase'], data_plot['p_id'])
ax.invert_yaxis()
ax.set_xlabel("Ocurrence of sci. proper nouns")
ax.set_title("Ocurrence of top 20 sci. phrases used as proper nouns across a sample of 1000 abstracts")

From the distribution we have been able to extract some useful phrases such as 'viral RNA', 'genomic RNA', 'real-time RT-PCR', 'quantitative PCR' etc. (Close to 1% of the abstracts in our sub-sample refer to the RT-PCR/PCR process aparently). Some of these my refer to nameds of specific scientific procedures and it might be useful to look at what these abstracts are commonly referring to.

In [None]:
grammar = "CHUNK: {<VBD|VBP>?<JJ*>?<NN|NP|NNP|NNS|NNPS>?<NNP>+}"
phrase_to_match = "real-time RT-PCR"
for p_id in np.unique(df_imp_pos.loc[df_imp_pos['phrase']==phrase_to_match, ['p_id']].values):
    print("_________________")
    doc = df_covid.loc[df_covid['paper_id'] == p_id, ['abstract']].values[0][0]
    sentences = nltk.sent_tokenize(doc)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    for sent in sentences:
        tagged_tokens = nltk.pos_tag(sent)
        cp = nltk.RegexpParser(grammar)
        tree = cp.parse(tagged_tokens)
        for subtree in tree.subtrees():
            if subtree.label() == 'CHUNK':
                #join words into one n-gram (depending upon how many matched the regex)
                word = ''
                leaf_count = len(subtree.leaves())
                for leaf in subtree.leaves():
                    word += leaf[0] + " "  
                if(phrase_to_match.strip() == word.strip()):
                    print(" ".join(sent))

The above shows the different types of findings using the RT-PCR (reverse transcription polymerase chain reaction) method. The above methods can be easily used to provide a **graphical interface** to researchers so as to make the navigation from important scientific phrases to the usage of them in actual sentences. We shall attempt to do that in a subsequent notebook. 
One interesting observation we make is the use of certain phrases to convey the fact that there were some findings from these experiments and procedures. 
It would be helpful to be able to extract these automatically from the abstracts. Let's take the same example above and try to grab such claims from the text.

For this expriment, we will need to pivot on a word. We will choose 'reveal' as the word that will most commonly refer to findings or claims.
We will use the POS tags to locate a a verb (past or present tense) since 'reveal' is one such POS.

In [None]:
def find_claims_in_abstracts(phrase_to_match, pivot_word="reveal"):
    grammar = "CHUNK: {<VBD|VBP>+}"
    for p_id in np.unique(df_imp_pos.loc[df_imp_pos['phrase']==phrase_to_match, ['p_id']].values):
        print("_________________")
        doc = df_covid.loc[df_covid['paper_id'] == p_id, ['abstract']].values[0][0]
        sentences = nltk.sent_tokenize(doc)
        sentences = [nltk.word_tokenize(sent) for sent in sentences]
        for sent in sentences:
            tagged_tokens = nltk.pos_tag(sent)
            cp = nltk.RegexpParser(grammar)
            tree = cp.parse(tagged_tokens)
            for subtree in tree.subtrees():
                if subtree.label() == 'CHUNK':
                    #join words into one n-gram (depending upon how many matched the regex)
                    word = ''
                    match = False
                    leaf_count = len(subtree.leaves())
                    for leaf in subtree.leaves():
                        sim = similarity(leaf[0], pivot_word)
                        if(sim >= .8):
                            match = True
                            break;
                    if (match == True):
                        print(" ".join(sent))

Let's examine the result for the word 'reveal'

In [None]:
find_claims_in_abstracts("real-time RT-PCR", "reveal")

We have correctly identified the claims / statements referring to findings for abstracts that primarily talk about the "real-time RT-PCR" procedure.

In [None]:
find_claims_in_abstracts("real-time RT-PCR", "discover")

We have even better insights when using the word 'discover'.

In [None]:
find_claims_in_abstracts("viral RNA", "reveal")

In [None]:
find_claims_in_abstracts("viral RNA", "detect")

In [None]:
find_claims_in_abstracts("viral RNA", "discover")

Again, the information seems to provide some clear insights about the findings from various experiments conducted. Use of different pivot words seems to provide slightly different but useful sets of information.