# What do we know about COVID-19 risk factors?
*About this Notebook*

    In this notebook,we have answered the questions given in the task by using lexical model(TF-IDF). Moreover, literature clustering is also being performed to better understand the domains of articles/documents and efficiently target/search the specific domain of articles. We worked on paragraph rather than documents to get the better results in question answering and literature clustering, and it worked.However, the end result is document (not paragraph: paragraphs are used only for training). Original dataset contains around 50k papers and many of them are not specifically about COVID-19. The dataset we have used is acquired by filtering the original 'CORD-19-research-challenge' dataset for specific COVID-19 papers/articles.  [this](https://www.kaggle.com/massiq/doctovec) notebook is used for filtering COVID-19 related papers/articles.[this](https://www.kaggle.com/maksimeren/covid-19-literature-clustering) notebook and [this](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089) article helped us doing literature clustering and lexical model respectively.By combining clustering and information retrieval system (TF-IDF) we can better understand and explore our own interest articles/papers.

*PROS*: 
*     It is very simple and efficient method.
*     We can easily compute the similarity between different documents using it.
*     It is very effective in extracting descriptive documents.
*     Base of almost every search engine.
    
*CONS*: 
*     This method is based on the bag-of-words (BoW) model, therefore it does not capture position in text, semantics, co-occurrences in different documents.

As our scenario is inclined more towards lexical approach,our model will work good in this scenario.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        os.path.join(dirname, filename)

# Any results you write to the current directory are saved as output.

Let's see where our dataset resides.

In [None]:
!ls /kaggle/input//covid19-filtered-dataset/

Importing some basic libraries which will be helpful in coming operations.

In [None]:
import numpy as np 
import pandas as pd 
import glob
import json

import matplotlib.pyplot as plt
plt.style.use('ggplot')

Read dataset file.

In [None]:
meta_df = pd.read_csv('/kaggle/input//covid19-filtered-dataset/covid_19_full_text_files.csv', dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

You can find meta information of dataset with .info() function.

In [None]:
meta_df.info()

Before we go towards information extraction part with TF-IDF, let's do clustering to get the overall insights of dataset. So, first of all we do preprocessing of data.
* Remove punctuations.
* Convert into lowercase.

For now, we will do these two operations.

In [None]:
import re,string

meta_df['text'] = meta_df['text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
meta_df.head()

In [None]:
def lower_case(input_str):
    input_str = input_str.lower()
    return input_str

meta_df['text'] = meta_df['text'].apply(lambda x: lower_case(x))
meta_df.head()

**Optional:**
As we do not need all the columns, we will keep only main columns so we can read and compare results easily.

In [None]:
text = meta_df.drop(["source_x","pmcid","pubmed_id","license","publish_time","Microsoft Academic Paper ID","WHO #Covidence","has_pdf_parse","has_pmc_xml_parse","full_text_file","tag_disease_covid19"], axis=1)
text.head()

Data may have duplicates informations, it is better to drop those rows.

In [None]:
text.drop_duplicates(['text'], inplace=True)
len(text)

It droped two rows, let's see the current count. 

In [None]:
text['text'].describe(include='all')

**Point to be noted****
Here is the main operations which will play an important role in getting better results in this notebook. As we know that an article/research paper does not describe a single topic but collection of closely related topics. Our dataset mainly concern about COVID-19. Every document is describing different aspects of this topic. On the other hand, paragraph is more specific about a single aspect of a specific topic. It is better if we look deeply into documents: take every paragraph as a document.It means that we will have documents and in a document there will be paragraph documents which will point towards that document.  Now, it make sense that we will get better results if we divide every document into paragraph and do operations on paragraph rather than a whole document. However, rest of the details will be same for every paragraph of a single document. You will understand better once you see the results. So, let's convert the documents.

In [None]:
para_list=pd.DataFrame(columns=['cord_uid','sha','paper_id','doi','journal','title','authors','abstract','text','url'])
i=0
for index,bodyText in text.iterrows(): 
    big_data_list=[]
    para=bodyText['text'].split('\n')
    for par in para:
      data_list=[]
      data_list.append(bodyText['cord_uid'])
      data_list.append(bodyText['sha'])
      data_list.append(bodyText['paper_id'])
      data_list.append(bodyText['doi'])
      data_list.append(bodyText['journal'])
      data_list.append(bodyText['title'])
      data_list.append(bodyText['authors'])
      data_list.append(bodyText['abstract'])
      data_list.append(par)
      data_list.append(bodyText['url'])
      big_data_list.append(data_list)
    para_df=pd.DataFrame(columns=['cord_uid','sha','paper_id','doi','journal','title','authors','abstract','text','url'], data=big_data_list)
    para_list=para_list.append(para_df)
    i+=1
#     print(i)
print("length: "+str(len(para_list)))

You can see that the 1737 documents are now converted into 53701 paragraph documents.Let's check the data.

In [None]:
para_list.head()

If you see the paragraph which are from same document have same rest of the information i.e title,abstract etc.
As, documents also have headings, notes and empty lines which are not of our concern because those are very small in length and also present in paragraph. We can get those informations from the paragraph. So, we will eliminate rows which have text less than 65 ((max words in titles/headings * average characters in a word)+spaces)

In [None]:
para_list=para_list[para_list['text'].map(len) > 65]
len(para_list)

Now, our dataset length reduced to 13404. let's take a look at the dataset.

In [None]:
para_list.head()

From here, we will start our clustering related operations. One thing we need to point out here is that we will work on body text only becuase it is our main concern but we can add title and abstract later to get more better results. 

First of all we need to vectorize the text(convert text into vector form). TF-IDF vectorizer is being used for vectorizing. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=2**12)
X = vectorizer.fit_transform(para_list['text'].values)
X.shape

Let's try to get our labels (which text comes under which cluster). we will use MiniBatchKMeans to clusterize the text(in vector form) as it is faster with more data (we can also use kMeans:it is a bit slower). Number of clusters is 20.

In [None]:
from sklearn.cluster import MiniBatchKMeans

k = 20
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
y = y_pred
y

As we have gotten our labels, we can plot them but vectorizer vectorizes the text in higher dimension, first we need to reduce our high dimensional features vector into 2 dimensional plane. For this process, we will use PCA as it scales very well with larger datasets and dimensions. It will keep similar instances together while trying to push different instances far from each other.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca_result = pca.fit_transform(X.toarray())

 It is easier to see the results in a 3 dimensional plot. So let's try to do that:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

ax = plt.figure(figsize=(16,10)).gca(projection='3d')
ax.scatter(
    xs=pca_result[:,0], 
    ys=pca_result[:,1], 
    zs=pca_result[:,2], 
    c=y, 
    cmap='tab10'
)
ax.set_xlabel('pca-one')
ax.set_ylabel('pca-two')
ax.set_zlabel('pca-three')
plt.title("PCA Covid-19 Articles (paragraph) (3D) - Clustered (K-Means,k=20) - Tf-idf with Plain Text")
# plt.savefig("plots/pca_covid19_label_TFID_3d.png")
plt.show()

You can see in the graph that there are 6,7 clusters which dominates the other clusters. It means that those paragraph documents are very closely related to each other.

Now we will add those clusters/labels to our dataset to increase our understanding.

In [None]:
para_list['Cluster']=y
para_list.head()

From here, we will start working on our TF-IDF model. Special thanks to William Scott, who did a complete TF-IDF model(https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089). It helped us a lot. We changed it with respect to our requirements and did some of changes according to our method.

First of all we will do preprocessing of paragraph document. If you remember we did preprocessing but that was for clustering, it is better if we do more deep preprocessing as now we will be extracting information not just clustering. It is needed that we make each text in specified format and follow certain rules.

In [None]:
import nltk
import os
import string
import numpy as np
import copy
import pandas as pd
import pickle
import re
import math

nltk.download("popular")
!pip install num2words
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from collections import Counter
from num2words import num2words



def remove_stop_words(data):
    stop_words = stopwords.words('english')
    words = word_tokenize(str(data))
    new_text = ""
    for w in words:
        if w not in stop_words and len(w) > 1:
            new_text = new_text + " " + w
    return new_text

def stemming(data):
    stemmer= PorterStemmer()
    
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        new_text = new_text + " " + stemmer.stem(w)
    return new_text
def convert_numbers(data):
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        try:
            w = num2words(int(w))
        except:
            a = 0
        new_text = new_text + " " + w
    new_text = np.char.replace(new_text, "-", " ")
    return new_text

def convert_lower_case(data):
    return np.char.lower(data)

def remove_punctuation(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
        data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

def remove_apostrophe(data):
    return np.char.replace(data, "'", "")


def preprocess(data):
    data = convert_lower_case(data)
    data = remove_punctuation(data) #remove comma seperately
    data = remove_apostrophe(data)
    data = remove_stop_words(data)
    data = convert_numbers(data)
    data = stemming(data)
    data = remove_punctuation(data)
    data = convert_numbers(data)
    data = stemming(data) #needed again as we need to stem the words
    data = remove_punctuation(data) #needed again as num2word is giving few hypens and commas fourty-one
    data = remove_stop_words(data) #needed again as num2word is giving stop words 101 - one hundred and one
    return data

In [None]:
processed_text=[]
i=0
for t in para_list['text']:
    processed_text.append(word_tokenize(str(preprocess(t))))
    if i%1000==0:
        print("text: "+str(i))
    i+=1

print(len(processed_text))

Let's define these two terms, so every person can understand what is going on.
* **TF of a document:** It is the term frequency in a document: how many times a word appears in a document.
* **IDF:** It is the inverse document frequency. It is the inverse of In how many documents a word appear. The less the number of document a word appears in, the more score it will have. It basically tells that how much a word uniquely defines a document.

To learn more about approach you can follow William's article. 
Let's find the term frequency.

In [None]:
N=len(processed_text)
DF = {}

for i in range(N):
    tokens = processed_text[i]
    for w in tokens:
        try:
            DF[w].add(i)
        except:
            DF[w] = {i}
for i in DF:
    DF[i] = len(DF[i])

Now we have term frequency of every unique word in our dataset. Let's store the length of unique words which will be our vocabulary size.

In [None]:
total_vocab_size = len(DF)
total_vocab_size

Let's store our vocabulary in a separate place.

In [None]:
total_vocab = [x for x in DF]
total_vocab[:20]

We have defined a function which will give us document frequency. we will use it to calculate document frequency on runtime and take it inverse to find IDF.

In [None]:
def doc_freq(word):
    c = 0
    try:
        c = DF[word]
    except:
        pass
    return c

As we have both TF and IDF calculator, now we will find TF-IDF score.

In [None]:
from collections import Counter
doc = 0

tf_idf = {}

for i in range(N):
    
    tokens = processed_text[i]
    
    counter = Counter(tokens)
    words_count = len(tokens)
    
    for token in np.unique(tokens):
        
        tf = counter[token]/words_count
        df = doc_freq(token)
        idf = np.log((N+1)/(df+1))
        
        tf_idf[doc, token] = tf*idf

    doc += 1
    
len(tf_idf)

As now we have TF-IDF score of every word in every document, we can calculate TF-IDF score of text in every document by simply adding the TF-IDF score of every word of text in every document. At the end we will sort them by high scores and get the K documents which have top high scores.

In [None]:
def matching_score(k, query):
    tokens = word_tokenize(str(query))

    print("Matching Score")
    print("\nQuery:", query)
    print("")
    print(tokens)
    
    query_weights = {}

    for key in tf_idf:
        
        if key[1] in tokens:
            try:
                query_weights[key[0]] += tf_idf[key]
            except:
                query_weights[key[0]] = tf_idf[key]
    
    query_weights = sorted(query_weights.items(), key=lambda x: x[1], reverse=True)

    print("")
    
    l = []
    
    for i in query_weights[:10]:
        l.append(i[0])
    
    print(l)
    print(para_list.iloc[l[0]])
    

matching_score(10, "pregnant woman")

This approach of finding query related document is good and simple but this does not work well when query length becomes large i.e 5,15,20 etc words length query. It will not work good in our scanerio because user may give longer query. Thus we will shift to cosine similarity finder.

let's define our cosine similarity score function.

In [None]:
def cosine_sim(a, b):
    cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
    return cos_sim

In [None]:
len(tf_idf)

let's do some preprocessing. We have defined a vector generator function which will create vector in space of given text.

In [None]:
D = np.zeros((N, total_vocab_size))
cnt=0
for i in tf_idf:
    if cnt%100000==0:
        print(cnt)
    try:
        ind = total_vocab.index(i[1])
        D[i[0]][ind] = tf_idf[i]
    except:
        pass
    cnt+=1

In [None]:
def gen_vector(tokens):

    Q = np.zeros((len(total_vocab)))
    
    counter = Counter(tokens)
    words_count = len(tokens)

    query_weights = {}
    
    for token in np.unique(tokens):
        
        tf = counter[token]/words_count
        df = doc_freq(token)
        idf = math.log((N+1)/(df+1))

        try:
            ind = total_vocab.index(token)
            Q[ind] = tf*idf
        except:
            pass
    return Q

Now we have every modular function to help, at first we will give query, query will be preprocessed in the same way every document is preprocessed so we can have better results. Then it will generate vector of query in space and check the similarity of query vector with every paragraph document vector by measuring distance between them and get the top k paragraph document which has high similarity scores.

After getting the paragraph document it will merge those paragraph which points to same document and show the results including documents informations i.e title,abstract,paper_id,text,sha,url/doi etc. It shows the specific paragraph of a document which is related to your query, you can also view the whole document by following url/doi given in the result.

Moreover, 'Cluster' field also added to results/Output, so you can see that in which cluster these documents falls in. you can also find dominating cluster in these results and search for documents of dominating cluster and read it.

In [None]:
def cosine_similarity(k, query):
    # print("Cosine Similarity")
    preprocessed_query = preprocess(query)
    tokens = word_tokenize(str(preprocessed_query))
    
    # print("\nQuery:", query)
    # print("")
    # print(tokens)
    
    d_cosines = []
    
    query_vector = gen_vector(tokens)
    
    for d in D:
        d_cosines.append(cosine_sim(query_vector, d))
        
    out = np.array(d_cosines).argsort()[-k:][::-1]
    
    result=pd.DataFrame(columns=['cord_uid','sha','paper_id','doi','journal','title','authors','abstract','text','url'])
    for file in out:
      found=result[result['paper_id'].str.contains(str(para_list.iloc[file]['paper_id']),na=False)]
      if len(found)!=0:
        indx=result.index[result['paper_id'] == para_list.iloc[file]['paper_id']]
        result.loc[indx]['text']=result.loc[indx]['text'] +"***************************************************"+ para_list.iloc[file]['text']
      else:
        result=result.append(para_list.iloc[file])
    return result

These were the question from the tasks. You can see their result.

In [None]:
pd.options.display.max_colwidth=70
questions=['Smoking, pre-existing pulmonary disease',
          'Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities',
          'Neonates and pregnant women',
          'Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences',
          'Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors',
          'Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups',
          'Susceptibility of populations',
          'Public health mitigation measures that could be effective for control']
Q = cosine_similarity(10, questions[0])
Q2 = cosine_similarity(10, questions[1])
Q3 = cosine_similarity(10, questions[2])
Q4 = cosine_similarity(10, questions[3])
Q5 = cosine_similarity(10, questions[4])
Q6 = cosine_similarity(10, questions[5])
Q7 = cosine_similarity(10, questions[6])
Q8 = cosine_similarity(10, questions[7])
Q.to_csv('./smokingPreExistingPulmonaryDisease.csv', index=False)
Q2.to_csv('./coinfectionsComorbities.csv', index=False)
Q3.to_csv('./neonatesPregnantWomen.csv', index=False)
Q4.to_csv('./socioeconomicBehavioralFactors.csv', index=False)
Q5.to_csv('./transmissionDynamicsVirus.csv', index=False)
Q6.to_csv('./severityDiseaseIncludingRisk.csv', index=False)
Q7.to_csv('./susceptibilityOfPopulations.csv', index=False)
Q8.to_csv('./publicHealthMitigationMeasures.csv', index=False)
Q8['title']

Let's see the whole result.

In [None]:
Q.head()

As, in above result(in my case) dominating clusters are 18 and 9. So you can check both clusters documents as well.

In [None]:
(para_list.loc[para_list['Cluster'] == 18]).head()