<a href="https://colab.research.google.com/github/vishrutmaheshwari/CORD19_ML/blob/master/Cord_Notebook_Team_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Mounting Google Drive (Necessary)

Mount Google Drive so that we can access data from Google Drive.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


### Importing Libraries (Necessary)

In [0]:
import os
import gc
import sys
import json

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import glob
import pickle
import spacy
from spacy import displacy
from spacy.matcher import Matcher
from sklearn import decomposition
from sklearn import datasets
from sklearn.cluster import KMeans

import nltk
import re
import string
from nltk.corpus import stopwords                   #Stopwords corpus
from nltk.stem import PorterStemmer                 # Stemmer
from nltk.tokenize import RegexpTokenizer

from sklearn.feature_extraction.text import CountVectorizer          
from sklearn.feature_extraction.text import TfidfVectorizer     

import gensim, re
from gensim.models.doc2vec import TaggedDocument, LabeledSentence


### Loading Metadata and data into DataFrame (Run only if you want to create the DataFrame again) -- *Approx Runtime = 9 mins*

Run only if you want to create the DataFrame again as c_df and export it as a full_data CSV

---
We loaded this data on kaggle. If you wanna run this on colab, make sure you have a directory '../input/CORD-19-research-challenge/metadata.csv' where the data is stored

---
Loaing Metadata



In [0]:
meta = pd.read_csv('../input/CORD-19-research-challenge/metadata.csv', low_memory=False)

Setting INPUT_DIR

In [0]:
INPUT_DIR = '../input/CORD-19-research-challenge'

Loading all documents in DataFrame c_df

In [0]:
c_df = pd.DataFrame()

for pdf_file in meta[meta.pdf_json_files.notnull()].pdf_json_files:
    
    current_data =  {"paper_id": None, "title": None, "abstract": None, "body_text": None}
    
    # Some entries have multiple paths separated by ';'
    pdf_file = pdf_file.split(';')[0] # take the first one
    
    with open(os.path.join(INPUT_DIR, pdf_file), 'rb') as jdata:
        pdf_data = json.load(jdata)
        
        current_data['paper_id'] = pdf_data['paper_id'] #Paper ID
        current_data['title'] = pdf_data['metadata']['title'] #Paper Title
        
        #Joining all the text of abstract together, and storing it
        abstract_all = [abstext['text'] for abstext in pdf_data['abstract']]
        abstract = "\n ".join(abstract_all)

        current_data['abstract'] = abstract

        #Joining all the text of body together, and storing it
        body_all = [bodtext['text'] for bodtext in pdf_data['body_text']]
        body = "\n ".join(body_all)
            
        current_data['body_text'] = body
        
        c_df = c_df.append(current_data, ignore_index=True) #Adding it in the dataframe
        
c_df.to_csv('drive/My Drive/data/full_data.csv')

### Loading the DataFrame from pre-created csv file. (Run only if you wanna create further dataframe for new preprocessing of tf_idf, or word embedding ) -- *Approx time 15s*

In [0]:
%%time

c_df = pd.read_csv('drive/My Drive/data/full_data.csv', low_memory=False)

CPU times: user 12 s, sys: 3.2 s, total: 15.2 s
Wall time: 25.4 s


Checking if it has loaded properly

In [0]:
c_df.tail()


Unnamed: 0.1,Unnamed: 0,abstract,body_text,paper_id,title
48919,48919,,An increasing number of xenobiotics are associ...,1c2ec5f7aa7133b58659aa5403e03f14da7ba1a1,2.11 Alterations in Blood Components
48920,48920,,"data. Such has been the case, for example, wit...",7d94ed06736e171ba688a2e72a622d6d9884f88c,Epidemiology in Perspective Descriptive Epidem...
48921,48921,MB78 is a virulent phage of Salmonella typhimu...,allow phages like P22 and 9NA to grow in its p...,45713f9e6085b078e4dc902790cf3e5cdca43151,Synthesis of a bacteriophage MB78 late protein...
48922,48922,,Guide d'ORL © 2016 Elsevier Masson SAS. Tous d...,4534efb66ced62c3f0bf6490732a97194cf50635,
48923,48923,,Aminopeptidase N (CD13) is a widely expressed ...,f616fc58218452c55948631f421b119fda4f60df,


## **tf_idf**

> ### Preprocessing the data and creating a dataframe (Run only if you want to preprocess the data again) -- *Approx Runtime = 15  mins*





 Dictionary for Covid-related terms




In [0]:
covid19_terms = {
    '2019-nCoV',
    '2019-nCoV.',
    '2019-n-CoV',
    '2019 n-CoV'
    '2019 ncov',
    '2019 n cov',
    '2019n cov',
    '2019ncov',
    '2019-novel Coronavirus',
    'coronavirus 2',
    'coronavirus 2019',
    'coronavirus-19',
    'covid',
    'COVID-19',
    'COVID19',
    'n cov 2019',
    'ncov 2019',
    'wuhan pneumonia',
    'wuhan virus',
    'wuhan coronavirus',
    'CoViD 19',
    'CoViD-19'
    }

Function to check if body contains covid19_terms

In [0]:
def has_covid19(text):
    for name in covid19_terms:
        if text and name.lower() in text.lower():
            return True
    return False

Remove docs if body text doesn't contain covid19_terms

In [0]:
df = c_df
df['body_has_covid19'] = df.body_text.apply(has_covid19)
dfBody=df['body_has_covid19']

df=df[dfBody]
df = df.drop(['body_has_covid19'], axis = 1)
df

Unnamed: 0.1,Unnamed: 0,abstract,body_text,paper_id,title
3,3,The key enzyme in coronavirus polyprotein proc...,Transmissible gastroenteritis virus (TGEV) bel...,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,
12,12,The exact origin of the cause of the Severe Ac...,"Severe Acute Respiratory Syndrome (SARS), a ne...",59ffca26c9d1f881e4e37e7c5af626d72d858dc4,Relationship of SARS-CoV to other pathogenic R...
13,13,Because of the constant threat posed by emergi...,"Over the past two decades, technological advan...",c63c4d58d170136b8d3b5a66424b5ac3f73a92d9,Viral Discovery and Sequence Recovery Using DN...
27,27,Background: Despite the passing of more than a...,"In the spring of 2003, the Severe Acute Respir...",b499a41ad1d85a0e071b59e60065c874e534911d,Coronavirus 3CL pro proteinase cleavage sites:...
34,34,The molecular basis of severe acute respirator...,The causative agent for SARS has been identifi...,45d2d838cb3d5ae2dabd7bb7c82329b398d1c65f,BMC Infectious Diseases A human in vitro model...
...,...,...,...,...,...
48903,48903,An RT-PCR-hybridization was developed that amp...,Human coronaviruses were described initially i...,8d1498b4b95b0bf8f6d9577764fffcecd1bae360,Direct diagnosis of human respiratory coronavi...
48911,48911,,Coronaviruses (CoV) are divided into four gene...,2e7f047e863ea96cf05fe606a7e8163d3217e1ab,Plausible mechanisms of Niclosamide as an anti...
48912,48912,,This appendix has been provided by the authors...,a243da5abdf029948b1fa8beb600db0ab124da63,Supplementary appendix
48914,48914,,A new emerged human coronavirus is reported in...,fbd0883af84e01bf3f6c874e9cb863934995a113,"Anti-HCV, nucleotide inhibitors, repurposing a..."


Preprocessing

In [0]:
from nltk.corpus import stopwords
nltk.download('stopwords')
ps = PorterStemmer()
stop_words = stopwords.words('english')
stop_words.extend(['copyright','peer','holder','preprint','author','funder','https','doi','org','reviewed',
                    'bioRxiv','et','found','sought','wondered','reserved','rights','reuse',
                    'permission','allowed','without','right','using','http','although','create','however',
                    'cc','by','license','biorxiv','medrxiv','fig','http','1101','et','al','figure',
                    'nc','nd','4','0','international','made','available','under','for','this',
                    'preprint','which','was','not','10','2020','01','30','927574','ma','02','03',
                    'could','non','within','il','three','value','shown','pcr','rnl','ml','min','mm','new','would',
                    'related','thus','per','site','like','known','may','also','used','know','e','g','settings','usefulness'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
# preprocessing involving removing stop words, converting to lower case,
# deleting non-alpha numeric character and stemming
def preprocess(sentences):
  temp = []
  sentences = sentences.split('.')
  for sentence in sentences:
      # convert to lower case
      sentence = sentence.lower()
      tokenizer = RegexpTokenizer(r'\w+')
      # remove punctuations
      tokens = tokenizer.tokenize(sentence)
      # remove stopwords and perform stemming
      words = [ps.stem(word) for word in tokens if word not in stop_words] 
      t = ' '.join(words)
      temp.append(t)
  return temp      

Applying preprocessing on dataframe

In [0]:
%%time 

df['body_preprocessed'] = df.body_text.apply(preprocess)

CPU times: user 5min 33s, sys: 491 ms, total: 5min 34s
Wall time: 5min 34s


Export DataFrame into csv

In [0]:
df.to_csv("drive/My Drive/data/tf_idf/df_for_tf_idf.csv")

> ### Loading DataFrame from pre-created csv (Necessary to run the main algorithm)

In [0]:
df = pd.read_csv('drive/My Drive/data/tf_idf/df_for_tf_idf.csv', low_memory=False)
df = df.drop(df.columns[0], axis=1)
df

Unnamed: 0,Unnamed: 0.1,abstract,body_text,paper_id,title,body_preprocessed
0,3,The key enzyme in coronavirus polyprotein proc...,Transmissible gastroenteritis virus (TGEV) bel...,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,,['transmiss gastroenter viru tgev belong coron...
1,12,The exact origin of the cause of the Severe Ac...,"Severe Acute Respiratory Syndrome (SARS), a ne...",59ffca26c9d1f881e4e37e7c5af626d72d858dc4,Relationship of SARS-CoV to other pathogenic R...,['sever acut respiratori syndrom sar newli ide...
2,13,Because of the constant threat posed by emergi...,"Over the past two decades, technological advan...",c63c4d58d170136b8d3b5a66424b5ac3f73a92d9,Viral Discovery and Sequence Recovery Using DN...,['past two decad technolog advanc molecular bi...
3,27,Background: Despite the passing of more than a...,"In the spring of 2003, the Severe Acute Respir...",b499a41ad1d85a0e071b59e60065c874e534911d,Coronavirus 3CL pro proteinase cleavage sites:...,['spring 2003 sever acut respiratori syndrom s...
4,34,The molecular basis of severe acute respirator...,The causative agent for SARS has been identifi...,45d2d838cb3d5ae2dabd7bb7c82329b398d1c65f,BMC Infectious Diseases A human in vitro model...,['caus agent sar identifi novel coronaviru 1 2...
...,...,...,...,...,...,...
9840,48903,An RT-PCR-hybridization was developed that amp...,Human coronaviruses were described initially i...,8d1498b4b95b0bf8f6d9577764fffcecd1bae360,Direct diagnosis of human respiratory coronavi...,['human coronavirus describ initi patient infe...
9841,48911,,Coronaviruses (CoV) are divided into four gene...,2e7f047e863ea96cf05fe606a7e8163d3217e1ab,Plausible mechanisms of Niclosamide as an anti...,['coronavirus cov divid four genera α β cov in...
9842,48912,,This appendix has been provided by the authors...,a243da5abdf029948b1fa8beb600db0ab124da63,Supplementary appendix,['appendix provid author give reader addit inf...
9843,48914,,A new emerged human coronavirus is reported in...,fbd0883af84e01bf3f6c874e9cb863934995a113,"Anti-HCV, nucleotide inhibitors, repurposing a...",['emerg human coronaviru report decemb 2019 wu...


In [0]:
df.tail()

Unnamed: 0,abstract,body_text,paper_id,title
9840,An RT-PCR-hybridization was developed that amp...,['human coronavirus describ initi patient infe...,8d1498b4b95b0bf8f6d9577764fffcecd1bae360,Direct diagnosis of human respiratory coronavi...
9841,,['coronavirus cov divid four genera α β cov in...,2e7f047e863ea96cf05fe606a7e8163d3217e1ab,Plausible mechanisms of Niclosamide as an anti...
9842,,['appendix provid author give reader addit inf...,a243da5abdf029948b1fa8beb600db0ab124da63,Supplementary appendix
9843,,['new emerg human coronaviru report decemb 201...,fbd0883af84e01bf3f6c874e9cb863934995a113,"Anti-HCV, nucleotide inhibitors, repurposing a..."
9844,,['aminopeptidas n cd13 wide express ectoenzym ...,f616fc58218452c55948631f421b119fda4f60df,


In [0]:
df.iat[0, 1]

"['transmiss gastroenter viru tgev belong coronavirida famili posit strand rna virus', 'coronavirus largest rna viral genom known date 28 500 nucleotid case tgev share similar genom organ common transcript translat strategi arterivirida den boon et al', '1991 cavanagh 1997', 'tgev infect associ sever often fatal diarrhoea young pig review see enjuan van der zeijst 1995 saif wesley 1999', 'viral protein requir tgev genom replic transcript encod replicas gene eleouet et al', '1995 penz et al', '2001', 'gene encod two replic polyprotein pp1a 447 kda pp1ab 754 kda process viru encod proteinas produc function subunit replic complex review ziebuhr et al', '2000', 'central c proxim region pp1a pp1ab process 33', '1 kda viral cystein proteinas call main proteinas pro altern thè 3c like proteinas 3cl pro', 'name 3c like proteinas introduc origin similar substrat speci citi coronaviru pro picornaviru 3c proteinas 3c pro identi cation cystein princip catalyt residu context predict two b barrel fo

> ### Main Algorithm -- *Approx Runtime = 15 mins*

Defining Pre-processor

In [0]:
from nltk.corpus import stopwords
nltk.download('stopwords')
ps = PorterStemmer()
stop_words = stopwords.words('english')
stop_words.extend(['copyright','peer','holder','preprint','author','funder','https','doi','org','reviewed',
                    'bioRxiv','et','found','sought','wondered','reserved','rights','reuse',
                    'permission','allowed','without','right','using','http','although','create','however',
                    'cc','by','license','biorxiv','medrxiv','fig','http','1101','et','al','figure',
                    'nc','nd','4','0','international','made','available','under','for','this',
                    'preprint','which','was','not','10','2020','01','30','927574','ma','02','03',
                    'could','non','within','il','three','value','shown','pcr','rnl','ml','min','mm','new','would',
                    'related','thus','per','site','like','known','may','also','used','know','e','g','settings','usefulness'])

# preprocessing involving removing stop words, converting to lower case,
# deleting non-alpha numeric character and stemming
def preprocess(sentences):
  temp = []
  sentences = sentences.split('.')
  for sentence in sentences:
      # convert to lower case
      sentence = sentence.lower()
      tokenizer = RegexpTokenizer(r'\w+')
      # remove punctuations
      tokens = tokenizer.tokenize(sentence)
      # remove stopwords and perform stemming
      words = [ps.stem(word) for word in tokens if word not in stop_words] 
      t = ' '.join(words)
      temp.append(t)
  return temp      

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Defining Queries 

In [0]:
initial_queries = ["Identification of suitable animal models for testing of COVID-19 post-exposure therapies, the best laboratory animal would not only get infected and get sick, but get sick in the same way that humans do, showing a similar course of disease.", 
                   "Convalescent plasma is one of the crudest forms of antibody therapy, which was used extensively to treat infectious diseases a century ago.", 
                   "Mice, hamsters, ferrets, monkeys, which lab animals can help defeat the new coronavirus? Beloved as pets, Syrian hamsters are winning another kind of attention from scientists trying to understand and defeat COVID-19.", 
                   "hydroxychloroquine have been prescribed to patients to reduce the length of hospital stay and improve the evolution of COVID-19-related pneumonia.", 
                   "Hydroxychloroquine is not expensive, is readily available, and seems to be safe.", 
                   "what's remdesivir? Remdesivir is a drug designed to obstruct the stage of replication, when the virus creates copies of itself, followed endlessly by the copies creating copies of themselves.", 
                   "Is remdesivir the treatment for corona viruses? Although the drug didn't work well against that disease, it later showed promise fighting SARS and MERS, illnesses caused by coronaviruses, in animal studies, which is why researchers thought remdesivir might help fight COVID-19.", 
                   "what is serological test? Antibody tests, also called serological tests, have usually been the time-tested approach to finding out the presence of a virus in the body, they do so by detecting the presence and quantity of antibodies that are produced by the immune system to battle an infection.", 
                   "what do antibody test reveal? Given that they are not useful for directly detecting the presence of the virus, antibody tests can be used to gauge the extent of infection in a community or a large group of people who may have had exposure to the virus.", 
                   "The new Elecsys Anti-SARS-CoV-2 serology test can support the detection of antibodies against SARS-CoV-2 in patients who have been exposed to the virus which causes COVID-19.", 
                   "where did corona virus originate? The coronavirus disease 19 (COVID-19) is a highly transmittable and pathogenic viral infection caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which emerged in Wuhan, China and spread around the world.", 
                   "Initially, interferons-α nebulization, broad-spectrum antibiotics, and anti-viral drugs were used to reduce the viral load, however Remdesivir only and in combination with chloroquine or interferon beta significantly blocked the SARS-CoV-2 replication and patients were declared as clinically recovered.", 
                   "These patients usually present with symptoms of an upper respiratory tract viral infection, including mild fever, cough (dry), sore throat, nasal congestion, malaise, headache, muscle pain, or malaise, Respiratory symptoms such as cough and shortness of breath (or tachypnea in children) are present without signs of severe pneumonia."
                  ]

queries = []

for query in initial_queries:
  query = preprocess(query)
  queries.append(query[0])

print(queries)
print(len(queries))

['identif suitabl anim model test covid 19 post exposur therapi best laboratori anim get infect get sick get sick way human show similar cours diseas', 'convalesc plasma one crudest form antibodi therapi extens treat infecti diseas centuri ago', 'mice hamster ferret monkey lab anim help defeat coronaviru belov pet syrian hamster win anoth kind attent scientist tri understand defeat covid 19', 'hydroxychloroquin prescrib patient reduc length hospit stay improv evolut covid 19 pneumonia', 'hydroxychloroquin expens readili seem safe', 'remdesivir remdesivir drug design obstruct stage replic viru creat copi follow endlessli copi creat copi', 'remdesivir treatment corona virus drug work well diseas later show promis fight sar mer ill caus coronavirus anim studi research thought remdesivir might help fight covid 19', 'serolog test antibodi test call serolog test usual time test approach find presenc viru bodi detect presenc quantiti antibodi produc immun system battl infect', 'antibodi test 

Defining Cosine Similarity Function

In [0]:
from scipy import spatial
def cos_sim(v1, v2):
  return 1 - spatial.distance.cosine(v1, v2)

The Algorithm

In [0]:
%%time

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# initialise vectoriser
vectorizer = TfidfVectorizer()

body_list = []
body_index = df.columns.get_loc("body_preprocessed")
id_index = df.columns.get_loc("paper_id")

for i in range(df.shape[0]):
  x = df.iat[i, body_index].strip('][').split(', ')
  body_list = body_list + x

# combine queries and body text to train model
vectors = queries + body_list
vectorizer.fit_transform(vectors)

# cosine similarity matrix containing cosine similarity values between all queries and all docs
all_cosine = []
for query in queries:
  query_vector = vectorizer.transform([query]).toarray()
  query_body_list = []
  for i in range(df.shape[0]):
    body = df.iat[i, body_index].strip('][').strip("'")
    body_vector = vectorizer.transform([body]).toarray()
    cosine_similarity = cos_sim(body_vector, query_vector)
    # print(cosine)
    query_body_list.append([df.iat[i, id_index], cosine_similarity])
  all_cosine.append(pd.DataFrame(query_body_list, columns = ["id", "cosine_value"]))  

##print(all_cosine)

CPU times: user 13min 10s, sys: 4min 10s, total: 17min 20s
Wall time: 17min 21s


Ranking the docs

In [0]:
# The above dataframe all_cosines has rows as queries and columns as doc titles.
# Each element of this matrix [i, j] is a list of length two.
# The first element of the list is the doc_id, 
# the second element is the cosine similarity value between query i and doc_title j

ranked_docs = []
for i in range(len(all_cosine)):
  query_cosine_values = all_cosine[i]
  # more code here
  query_cosine_values.sort_values(by = "cosine_value", ascending = False, inplace = True)
  sorted_list = query_cosine_values.iloc[0:10]
  ranked_docs.append(sorted_list)
  
print(ranked_docs) 

[                                            id  cosine_value
8214  ab0920d9c9b4b36d63943f751a372890ac1fed0e      0.303527
529   c93765c4a3b318802baaf2c6a68f11d1de9ca8f7      0.273709
5984  a99f02efcd401fc017aab682571c7af17c29b634      0.261685
6659  757b5c18ea0d5f36ba9ea742d22cac06e30e371c      0.236913
1738  960ce652ea3f7fb9df40da78e21eb1efbdf577ad      0.218901
1062  898203edb625248406fc805e60a05fabb72eb216      0.218077
2444  e9d64a31d412b2d0c91a37631f1c359b11142304      0.212335
5197  4ccace93e8bd13df5cb19e4211ed66f64cc4891c      0.211581
9576  1d162cd83fb09ad4919ab8445cfdef7c6206fe50      0.207581
9088  d0786f512c50de64c4f50dd3f881ea6f9e2dd343      0.207508,                                             id  cosine_value
4207  82f84285306e119b5994713bf84feb0e018667ca      0.390903
3506  583bd048e2c150cca317e633e43cd04f76e59594      0.338788
577   f242a1a7a7c8f4027c8793025c07885eea2359c5      0.295111
5130  7e61af45f92e097e572bf735c647a31040deb210      0.290815
4145  9614c166792357ae

## **Word Embedding**

> ### Preprocessing the data and creating a DataFrame (Run only if you want to preprocess the data again) -- *Approx Time 20 minutes*

Dropping null values

In [0]:
w_df=c_df.dropna(subset=['body_text'])
w_df=w_df[['paper_id', 'title', 'abstract', 'body_text']].copy()

Dictionary 

In [0]:
covid19_terms = {
    '2019-nCoV',
    '2019-nCoV.',
    '2019-n-CoV',
    '2019 n-CoV'
    '2019 ncov',
    '2019 n cov',
    '2019n cov',
    '2019ncov',
    '2019-novel Coronavirus',
    'coronavirus 2',
    'coronavirus 2019',
    'coronavirus-19',
    'covid',
    'COVID-19',
    'COVID19',
    'n cov 2019',
    'ncov 2019',
    'wuhan pneumonia',
    'wuhan virus',
    'wuhan coronavirus',
    'CoViD 19',
    'CoViD-19',
    }

Function to check if body contains covid19_terms

In [0]:
def has_covid19(text):
    for name in covid19_terms:
        if text and name.lower() in text.lower():
            return True
    return False

Removing body text which doesn't contain covid terms

In [0]:
w_df['body_has_covid19']=w_df.body_text.apply(has_covid19)

In [0]:
w_df=w_df[w_df['body_has_covid19']==True]
w_df=w_df.drop(['body_has_covid19'], axis = 1)

Pre-processing

In [0]:
nltk.download('stopwords')
ps = PorterStemmer()

In [0]:
stop_words = stopwords.words('english')
stop_words.extend(['copyright','peer','holder','preprint','author','funder','https','doi','org','reviewed',
                    'bioRxiv','et','found','sought','wondered','reserved','rights','reuse',
                    'permission','allowed','without','right','using','http','although','create','however',
                    'cc','by','license','biorxiv','medrxiv','fig','http','1101','et','al','figure',
                    'nc','nd','4','0','international','made','available','under','for','this',
                    'preprint','which','was','not','10','2020','01','30','927574','ma','02','03',
                    'could','non','within','il','three','value','shown','pcr','rnl', 'post', 'ml','min','mm','new','would',
                    'related','thus','per','site','like','known','may','also','used','know','e','g','settings', 'april', 'march','usefulness'])

Defining pre-processor

In [0]:
def preprocess(sentences):
    # convert to lower case
    sentences = sentences.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    # remove punctuations
    tokens = tokenizer.tokenize(sentences)
    # remove stopwords and perform stemming
    words = [ps.stem(word) for word in tokens if (word not in stop_words and (len(word)!=1))] 
    return words 

In [0]:
w_df['body_preprocessed'] = w_df.body_text.apply(preprocess)

Saving Dataframe

In [0]:
w_df.to_csv("drive/My Drive/data/word_em/w_df_for_doc2vec.csv")

> ### Loading Pre-processed data (Necessary for training the model and for the output of algorithm aswell) (Necessary for pre-processing of LDA Topic Model)

In [0]:
w_df = pd.read_csv("drive/My Drive/data/word_em/w_df_for_doc2vec.csv")

> ### Training the Model(Run only if you want to train the model again) #DUSHYANT FIX THIS

In [0]:
#CONVERT body_preprocessed str into list
#Copy the code

MAKE SURE TO USE w_df instead of c_df

> ### Using the pretrained model for algorithm 

Loading the model

In [0]:
model = gensim.models.doc2vec.Doc2Vec.load("drive/My Drive/data/word_em/model.model")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Defining Pre-processor

In [0]:
nltk.download('stopwords')
ps = PorterStemmer()

stop_words = stopwords.words('english')
stop_words.extend(['copyright','peer','holder','preprint','author','funder','https','doi','org','reviewed',
                    'bioRxiv','et','found','sought','wondered','reserved','rights','reuse',
                    'permission','allowed','without','right','using','http','although','create','however',
                    'cc','by','license','biorxiv','medrxiv','fig','http','1101','et','al','figure',
                    'nc','nd','4','0','international','made','available','under','for','this',
                    'preprint','which','was','not','10','2020','01','30','927574','ma','02','03',
                    'could','non','within','il','three','value','shown','pcr','rnl','ml','min','mm','new','would',
                    'related','thus','per','site','like','known','may','also','used','know','e','g','settings','usefulness'])

def preprocess(sentences):
    # convert to lower case
    sentences = sentences.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    # remove punctuations
    tokens = tokenizer.tokenize(sentences)
    # remove stopwords and perform stemming
    words = [ps.stem(word) for word in tokens if (word not in stop_words and (len(word)!=1))] 
    return words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Defining Queries 

In [0]:
initial_queries = [
                   "Identification of suitable animal models for testing of COVID-19 post-exposure therapies, the best laboratory animal would not only get infected and get sick, but get sick in the same way that humans do, showing a similar course of disease.", 
                   "Convalescent plasma is one of the crudest forms of antibody therapy, which was used extensively to treat infectious diseases a century ago.", 
                   "Mice, hamsters, ferrets, monkeys, which lab animals can help defeat the new coronavirus? Beloved as pets, Syrian hamsters are winning another kind of attention from scientists trying to understand and defeat COVID-19.", 
                   "hydroxychloroquine have been prescribed to patients to reduce the length of hospital stay and improve the evolution of COVID-19-related pneumonia.", 
                   "Hydroxychloroquine is not expensive, is readily available, and seems to be safe.", 
                   "what's remdesivir? Remdesivir is a drug designed to obstruct the stage of replication, when the virus creates copies of itself, followed endlessly by the copies creating copies of themselves.", 
                   "Is remdesivir the treatment for corona viruses? Although the drug didn't work well against that disease, it later showed promise fighting SARS and MERS, illnesses caused by coronaviruses, in animal studies, which is why researchers thought remdesivir might help fight COVID-19.", 
                   "what is serological test? Antibody tests, also called serological tests, have usually been the time-tested approach to finding out the presence of a virus in the body, they do so by detecting the presence and quantity of antibodies that are produced by the immune system to battle an infection.", 
                   "what do antibody test reveal? Given that they are not useful for directly detecting the presence of the virus, antibody tests can be used to gauge the extent of infection in a community or a large group of people who may have had exposure to the virus.", 
                   "The new Elecsys Anti-SARS-CoV-2 serology test can support the detection of antibodies against SARS-CoV-2 in patients who have been exposed to the virus which causes COVID-19.", 
                   "where did corona virus originate? The coronavirus disease 19 (COVID-19) is a highly transmittable and pathogenic viral infection caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which emerged in Wuhan, China and spread around the world.", 
                   "Initially, interferons-α nebulization, broad-spectrum antibiotics, and anti-viral drugs were used to reduce the viral load, however Remdesivir only and in combination with chloroquine or interferon beta significantly blocked the SARS-CoV-2 replication and patients were declared as clinically recovered.", 
                   "These patients usually present with symptoms of an upper respiratory tract viral infection, including mild fever, cough (dry), sore throat, nasal congestion, malaise, headache, muscle pain, or malaise, Respiratory symptoms such as cough and shortness of breath (or tachypnea in children) are present without signs of severe pneumonia."
                  ]

queries = []

for query in initial_queries:
  query = preprocess(query)
  queries.append(query)

Recommendation

In [0]:
new_vector = model.infer_vector(queries[0])
similar = model.docvecs.most_similar(positive=[new_vector])

  if np.issubdtype(vec.dtype, np.int):


In [0]:
print(queries[0])

['identif', 'suitabl', 'anim', 'model', 'test', 'covid', '19', 'post', 'exposur', 'therapi', 'best', 'laboratori', 'anim', 'get', 'infect', 'get', 'sick', 'get', 'sick', 'way', 'human', 'show', 'similar', 'cours', 'diseas']


In [0]:
print(similar)

[(7093, 0.4085010886192322), (2542, 0.393079936504364), (7493, 0.3871307373046875), (381, 0.3821891248226166), (7361, 0.3816722631454468), (2568, 0.3805750012397766), (6805, 0.3755342662334442), (1791, 0.36889898777008057), (1011, 0.36831462383270264), (1879, 0.3601797819137573)]


In [0]:
n=1
for i in similar:
    print("\n Recomendation ", n, "\n")
    n=n+1
    print(w_df.body_text.iat[i[0]])


 Recomendation  1 

What clinicians and public health experts need most in the next phase of COVID-19 response is an estimate of the prevalence of SARS-COV-2 in the US population. [1] [2] However, in this period before researchers obtain evidence from population-based studies on infection dynamics with accurate and reliable tests, it is still critical to have an idea of the proportion testing positive for infection among the tested population, even given variations in test accuracy, reliability and differences in testing criteria over time.
 Tracking the number of confirmed positives each day over time for a city or state 3 is problematic since this number (numerator) lacks a defined denominator for estimation.
 Fluctuations in the number of confirmed cases may reflect uneven testing capacity, batched processing or reporting. A city may appear to have more cases just from ramping up testing, or may appear to have a resurgence or reduction of cases due to changes in who can get tested,

## **LDA Topic Modelling**

> ### Creating Pre-processed data (Only run if you want to create the model again) -- *Approx Runtime = 15 seconds*

We will be using the already pre-processed data in tf-idf, instead of preprocessing whole data again. Make sure you have loaded  pre-processed data from tf-idf.

Joining words, separated in word embedding.


In [0]:

tm_df =  pd.read_csv('drive/My Drive/data/tf_idf/df_for_tf_idf.csv', low_memory=False)

for i in range(tm_df.shape[0]) :
    text = tm_df.at[i, 'body_preprocessed'].strip('][').split(', ')
    newtext = " "
    newtext = newtext.join(text)
    newtext = newtext.replace("'", "")
    tm_df.at[i, 'body_preprocessed'] = newtext

In [0]:
tm_df.head()


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,abstract,body_text,paper_id,title,body_preprocessed
0,3,3,The key enzyme in coronavirus polyprotein proc...,Transmissible gastroenteritis virus (TGEV) bel...,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,,transmiss gastroenter viru tgev belong coronav...
1,12,12,The exact origin of the cause of the Severe Ac...,"Severe Acute Respiratory Syndrome (SARS), a ne...",59ffca26c9d1f881e4e37e7c5af626d72d858dc4,Relationship of SARS-CoV to other pathogenic R...,sever acut respiratori syndrom sar newli ident...
2,13,13,Because of the constant threat posed by emergi...,"Over the past two decades, technological advan...",c63c4d58d170136b8d3b5a66424b5ac3f73a92d9,Viral Discovery and Sequence Recovery Using DN...,past two decad technolog advanc molecular biol...
3,27,27,Background: Despite the passing of more than a...,"In the spring of 2003, the Severe Acute Respir...",b499a41ad1d85a0e071b59e60065c874e534911d,Coronavirus 3CL pro proteinase cleavage sites:...,spring 2003 sever acut respiratori syndrom sar...
4,34,34,The molecular basis of severe acute respirator...,The causative agent for SARS has been identifi...,45d2d838cb3d5ae2dabd7bb7c82329b398d1c65f,BMC Infectious Diseases A human in vitro model...,caus agent sar identifi novel coronaviru 1 2 3...


> ### Creating the model


Installing ktrain

In [0]:
!pip install ktrain

Creating Model

In [0]:
%%time

import ktrain
ktrain.text.preprocessor.detect_lang = ktrain.text.textutils.detect_lang
texts = tm_df["body_preprocessed"] 
tm = ktrain.text.get_topic_model(texts, n_topics=None, n_features=10000)

n_topics automatically set to 70
lang: en
preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.
CPU times: user 4min 51s, sys: 2min 30s, total: 7min 21s
Wall time: 3min 56s


Printing Topics

In [0]:
tm.print_topics()
tm.build(texts, threshold=0.25)


topic 0 | drug antivir viral trial treat dose inhibit therapi activ chloroquin
topic 1 | mrna bind rna interact protein trna ifit1 ifit viral complex
topic 2 | surgeri manag staff team resourc icu procedur servic physician unit
topic 3 | women pregnant pregnanc infant neonat mother deliveri matern birth newborn
topic 4 | perpetu grant countri display transmiss death post version contact april
topic 5 | anfi die week antibodi type sequenc strain ein epitop rna
topic 6 | fibrot ild ipf rituximab viral cell activ detect nsp13 lung
topic 7 | evd marin crit rss elsevi signal une que hrv est
topic 8 | cell rna gene viral express protein sequenc host sampl replic
topic 9 | pattern frequenc usag codon subtyp bia composit iav bias valu
topic 10 | mhv protein rna cell gene replic ibv express mutant sequenc
topic 11 | residu sequenc genom rna mutat catalyt activ structur domain sti
topic 12 | mer ncov hcov anim coronavirus transmiss syndrom bat market novel
topic 13 | mask protect ppe room proced

In [0]:
%%time

texts = tm.filter(texts)
tm_df = tm.filter(tm_df)

ValueError: ignored

Creating Recommender

In [0]:
%%time
tm.train_recommender()

Saving Model as Pickle

In [0]:
%%time
import pickle 
  
# Save the trained model as a pickle string. 
Pkl_Filename = "drive/My Drive/data/lda_tm/tmPickle_tm_Model.pkl"  

with open(Pkl_Filename, 'wb') as file:  
    pickle.dump(tm, file)

tm_df.to_csv("drive/My Drive/data/lda_tm/tm_df_for_lda_tm.csv")
