
# Goal of this Notebook is to solve task 3:
**What do we know about virus genetics, origin, and evolution? What do we know about the virus origin and management measures at the human-animal interface?
**

# Introduction

>**Our team:**
>>We are an international team of passionate data scientist and AI engineers at Digital Product School at the Center for Business Creation and Innovation at Technical University of Munich, Germany. Afsane Asaei, Anubhav Jain, and Diana Amiri

>By participating in this challenge, we hope to contribute to this global effort against Covid19 pandemic. Our aim is to provide valuable insights regarding the questions of task 3 for the doctors and researchers. We plan to do text data mining on scientific literatures. This method has been proven to be a good strategy to handle fast growing accumulation of scientific papers, which is a valid case at this moment due to the world-wide pandemic situation of Covid-19. Therefore, a sequence of steps needs to be done to find the most relative articles and snippets to the corresponding key questions. Our general approach includes feature engineering with TFIDF and PCA, topic modeling with unsupervised learning, and text classification with KNN to ease the process of finding the most relative answer.



>>### Medical Genetic Insights
>>As text data grows in scale and complexity, specific factors need to be identified that influences the appropriateness of related articles. These factors make it possible to uncover valuable insights. To identify these influential factors, we have interviewed Prof. Dr. Hassan Vahidnezhad, Ph.D. of Medical Geneticist and a faculty member at Jefferson Institute of Molecular Medicine. In addition, we have consulted Dr. M.D. Sara Sadr, and Dr. M.D. Fateme Jafari, Faculty of Medicine and University Hospital.

>>The team of Medical experts helped us in having more relevant keywords that we have used for data preparation and filtering. They also added to the rich text of Task 3 with relevant variants of the concrete topics currently in focus of research and analysis on Covid19. The rich text of all variants of fundamental questions in Task 3 are clustered in 4 categories. These categories were crucial to develop a more specific concept in finding and classifying relative articles and snippets. The categories are as follow:

>>> 1. Characteristics<br/>
>>> 2. Virulence<br/>
>>> 3. Origin<br/>
>>> 4. Role of virus genetics in infection and treatments<br/>

>>### Text Mining using Natural Language Processing Steps 


>>In data preprocessing we defined characteristic labels for articles based on specific tasks. The preprocessing helped us narrow down the number of articles that are related to these questions and provide a less noisy text data for feature engineering and text classification.

>>In feature engineering many experiments have been done by our team to find the best vectorization and dimensionality reduction method that provides highest accuracy for this dataset (details of our research/experiment can be seen in our [github link](https://github.com/DigitalProductschool/covid-19-kaggle)). 

>>On the next step, we did the clustering of articles with K-means unsupervised learning to visualize the similarity measure between the articles. The optimum number of clusters is estimated through the Elbow method. **Each cluster is given a particular label. Then, the data is sampled randomly and divided into two training (%80 of data) and testing sets (%20 of data) and used for classification**.

>>We then evaluate our model with the KNN classification algorithm which indeed was successful with 98% of accuracy. This approach enabled us to compare the performance of different sets of features. We choose the best performing features for text mining. 

>>Finally, we sort the documents based on cosine similarity (close to 1) to each one of key questions in Task3. We would like to emphasize that the sequences of feature engineering, topic modeling, and similarity measurement methods have been examined and the final best method in accuracy and least distance has been provided. 

>>>1 Data preprocessing<br/>
>>>2 Text cleaning<br/>
>>>3 Feature engineering (TF-IDF with PCA)<br/>
>>>4 Visualisation and evaluation (with unsupervised k-means and text classification, KNN)<br/>
>>>5 Sorting the trained articles based on each of key questions (KNN and cosine similarity)<br/>

>>### Result of this work
>>We provided a <u>table</u> and <u>word cloud</u> that works as literature review.<br/>
>>These tables are related to the key questions of task3. Therefore, the details that we present include: the **author**, **paper title**, **university** and most importantly the **snippets** of the article body text that is most relevant and insightful to the question that has been asked.<br/>
>>Besides, we assist the reader with a word cloud too as a very simple alternative to back up the relativity of presented articles in the table.

# Loading the data

>1. We are going to extract 'full body texts' , 'paper id', and 'abstract' from jason files.
>2. We load the meta data too, because some of their characteristics like author, sha and journal name will be helpful for us.
>3. Then we merge 1 and 2 into a dataframe together
>4. We omit the duplicated rows from the dataframe


> This part of our notebook is inspired by a notebook from [MaksimEkin in kaggle](https://www.kaggle.com/maksimeren/covid-19-literature-clustering)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
import json

import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [None]:
### 1-Loading meta data ###

root_path = '/kaggle/input/CORD-19-research-challenge/'
metadata_path = f'{root_path}metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

In [None]:
### 2-Fetch All of JSON File Path ###


all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

In [None]:
### helper function to extract body text, abstract, and paper id from json files ###

#file reader class
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            try:
                self.paper_id = content['paper_id']
            except Exception as e:
                self.paper_id = ''
            self.abstract = []
            self.body_text= []
            
            # Abstract
            try:
                for entry in content['abstract']:
                    self.abstract.append(entry['text'])
            except Exception as e:
                pass
            # Body text
            
            try:
                for entry in content['body_text']:
                    self.body_text.append(entry['text'])
            except Exception as e:
                pass
            
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}:{self.abstract[:200]}... {self.body_text[:200]}...'
    
first_row = FileReader(all_json[0])

In [None]:
## 3- Load the Data into DataFrame ##

#Using the helper functions, let's read in the articles into a DataFrame that can be used easily:
    
dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [] }
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 500) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
    
        
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    dict_['title'].append(meta_data['title'].values[0])
    dict_['authors'].append(meta_data['authors'].values[0])

    
    # add the journal information
    dict_['journal'].append(meta_data['journal'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id','abstract', 'body_text', 'authors', 'title', 'journal'])
df_covid.head()



In [None]:
## 4 -droping the duplicates ##
dict_ = None

df_covid.info()

df_covid.drop_duplicates(['abstract', 'body_text'], inplace=True)

df_covid['abstract'].describe(include='all')
df_covid['body_text'].describe(include='all')


## droping the null values##
df_covid.dropna(inplace=True)


# Preprocessing

**After creating a dataframe we try to do some cleaning and feature engineering on our data**

1. Adding extra column to our data frame which tages the language of the papers<br/>
2. Keeping the articles that are in english language<br/>
3. Based on our user interview with Prof. Dr.Vahidnezhad adding extra column to our data which labels the articles based on specifications of task 3<br/>
4. Keeping the articles that should be definitley considered for task 3 thus reducing the noise<br/>
5. Preparing the text to be ready for vectorization in the next steps<br/>


In [None]:
#import sys
#!{sys.executable} -m pip install --upgrade pip

!pip install langdetect


In [None]:
## 1- labelling the languages of articles in a separate column ##

from langdetect import detect
df_covid['language'] = df_covid['title'].apply(detect)

#checking what languages are present in the dataset
uniqueValues = df_covid['language'].unique()

## plotting top 5 highest number of published languages ##
import seaborn as sns
plt.figure(figsize=(10,5))
sns.set_style('darkgrid')
sns.countplot(x='language',data=df_covid,order=pd.value_counts(df_covid['language']).iloc[:5].index,palette='viridis')
plt.title('language distribution')



**Taking a look at the distribution of languages in available articles in this data set, 
it is logical to keep only the papers with english language only **

In [None]:
## 2- keeping english language only ##

df_covid = df_covid.loc[df_covid['language'] == 'en']
df_covid.info()

*Regarding our user interview Dr. Vahinezhad, we concluded that we should label the articles based on specific keywords that are related to **'Covid 19'** , **'SARS'**, **'MERS', 'Proteins'** and **'Genomes'***

>Part of the following cell is inspired by [notebook](https://www.kaggle.com/ajrwhite/covid-19-thematic-tagging-with-regular-expressions) of ajrwhite

In [None]:
## 3- finding papers takling about covid and sars and mers, proteins, genome ##


# a function that will work as a searching machine to find specific keywords in papers
import re
def pattern_searcher(search_str:str, search_list:str):

    search_obj = re.search(search_list, search_str)
    if search_obj :
        return True
    else:
        return False
    
###keywords related to covid that we are looking for in literatures  
covid_list = ['covid',
                    'coronavirus disease 19',
                    'sars cov 2', # Note that search function replaces '-' with ' '
                    '2019 ncov',
                    '2019ncov',
                    r'2019 n cov\b',
                    r'2019n cov\b',
                    'ncov 2019',
                    r'\bn cov 2019',
                    'novel corona',
                    'coronavirus 2019',
                    'wuhan pneumonia',
                    'wuhan virus',
                    'wuhan coronavirus',
                    r'coronavirus 2\b',
                    'chinese coronavirus', 
                    'sars-cov-2',
                    'novel coronavirus', 
                    'coronavirus', 
             ]
pattern_covid = '|'.join(covid_list)

#adding extra column that tags papers that are related to covid19
df_covid['covid_match'] = df_covid['body_text'].apply(lambda x: pattern_searcher(search_str=x, search_list=pattern_covid))


###keywords related to protein and receptor surface protein that we are looking for in literatures
protein_list = ['protein','surface protein', 'proteome', 'proteomics', 'proteomic','proteins']
pattern_protein = '|'.join(protein_list)

#adding extra column that tags papers that are related to Protein
df_covid['protein_match'] = df_covid['body_text'].apply(lambda x: pattern_searcher(search_str=x, search_list=pattern_protein))



###keywords related to genomic sequences that we are looking for in literatures
genom_list=['genomic','genomics', 'genome', 'genome sequence', 'genome-wide']
pattern_genom = '|'.join(genom_list)

#adding extra column that tags papers that are related to Genomic sequences
df_covid['genom_match'] = df_covid['body_text'].apply(lambda x: pattern_searcher(search_str=x, search_list=pattern_genom))



In [None]:
df_covid.head()

**Only keeping the papers that should be the most related to task 3**<br/>
regarding iterative discussions of final results, at the end we decided to keep the papers that have these conditions:<br/>
>labelled with **_covid_match_** only




In [None]:
### only keeping the papers that should be most related to task3 and provides good results###

df_covid = df_covid.loc[(df_covid['covid_match'] == True) ] 
df_covid.info()


In [None]:
df_covid.head()

## **Preparing the text to be ready for vectorization in the next steps**


**Data Pre-Processing Steps**

Even Though the extracted data is subject-oriented, it still contains noise which is from irrelevant terms, symbols, and punctuation marks. These undesired data will reduce the performance of the classifier therefore, they should be removed from the corpus via following steps:<br/>

>1. _Stop words removal:_<br/>
>There always exist many unrelated terms such as ‘is’, ‘a’, ‘this’ and etc in a text which do not have any value in terms of feature generation. However if we don't omit them from the body of text, they increase the dimensionality of features, therefore, results in computational complexity of text classification model.<br/>

>2. _Character transformation:_<br/>
>A word may appear differently in one corpus or different corpora. Therefore, letter transformation should be used to convert all the tokens into a standard format.<br/>

>3. _Character transformation:_<br/>
>Usually, words in languages take several forms, that can be due to the grammatical context, which then should be considered as inflected. This matter happens mostly to verbs but also to nouns and adjectives in English language. Normalization to token reduces the inflection of words thus reducing the dimensionality of features.<br/>



In [None]:
## 4-preparing the text to be ready for vectorization in the next steps ##

import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords

from nltk.stem.wordnet import WordNetLemmatizer

lemma = WordNetLemmatizer()


stopword_set = set(stopwords.words('english')+['a','at','s','for','was', 'we', 'were', 'what', 'when', 'which', 'while', 'with', 'within', 'without', 'would', 'seem', 'seen','several', 'should','show', 'showed', 'shown', 'shows', 'significantly', 'since', 'so', 'some', 'such','obtained', 'of', 'often', 'on', 'our', 'overall','made', 'mainly', 'make', 'may', 'mg','might', 'ml', 'mm', 'most', 'mostly', 'must', 'each', 'either', 'enough', 'especially', 'etc','had', 'has', 'have', 'having', 'here', 'how', 'however', 'upon', 'use', 'used', 'using', 'perhaps', 'pmid','can', 'could', 'did', 'do', 'does', 'done', 'due', 'during', 'et al', 'found','study','observed','identified','fig','although','reported','group','result','include', 'figure', 'table'])

stop_set=['a','at','s','for','was', 'we', 'were', 'what', 'when', 'which', 'while', 'with', 'within', 'without', 'would', 'seem', 'seen','several', 'should','show', 'showed', 'shown', 'shows', 'significantly', 'since', 'so', 'some', 'such','obtained', 'of', 'often', 'on', 'our', 'overall','made', 'mainly', 'make', 'may', 'mg','might', 'ml', 'mm', 'most', 'mostly', 'must', 'each', 'either', 'enough', 'especially', 'etc','had', 'has', 'have', 'having', 'here', 'how', 'however', 'upon', 'use', 'used', 'using', 'perhaps', 'pmid','can', 'could', 'did', 'do', 'does', 'done', 'due', 'during', 'et al', 'found','study','observed','identified','fig','although','reported','group','result','include', 'figure', 'table']

#definig a function that cleans the string of full body text
def process(string):
    string=' '+string+' '
    string=' '.join([word if word not in stopword_set else '' for word in string.split()])
    string=re.sub('\@\w*',' ',string)
    string=re.sub('\.',' ',string)
    string=re.sub("[,#'-\(\):$;\?%]",' ',string)
    string=re.sub("\d",' ',string)
    string=re.sub(r'[^\x00-\x7F]+',' ', string)
    for i in stop_set:
        string = re.sub(' ' +i+' ', ' ', string)
    string=" ".join(lemma.lemmatize(word) for word in string.split())
    string=re.sub('( [\w]{1,2} )',' ', string)
    string=re.sub("\s+",' ',string)
    string=string.replace('[', '')
    string=string.replace(']', '')
    return string


df_covid['processed_text'] = df_covid['body_text'].apply(process)

# Feature engineering (TF-IDF vectorization with PCA)

 **Feature engineering**
>To perform machine learning algorithms on textual data it needs to be converted from text format to the numeric format. The method used for converting the preprocessed text data into feature vectors is called feature engineering. For this section, we have used four different methods.<br/>

> _Term Frequency Inverse Document Frequency (TF-IDF):_<br/>
>After pre-processing the body text, TF-IDF vectorization [[1](https://www.researchgate.net/publication/326425709_Text_Mining_Use_of_TF-IDF_to_Examine_the_Relevance_of_Words_to_Documents)] was utilized to vectorize the text. The major advantage of TF-IDF based vectorization was that it can be utilized to score and rank documents based on a user query. Moreover, it diminishes the weight of commonly occurring words and increases the weight of words that occur rarely. TF-IDF has also shown excellent results in search engines. <br/>

>The below code utilizes scikit learn’s implementation of TF-IDF to perform the task.<br/> 


In [None]:
## 1-Performing TF-IDF Vectorization on the data##

from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_covid['processed_text'].values)
X = vectorizer.transform(df_covid['processed_text'].values)


>After performing TF-IDF vectorization, to perform unsupervised dimensionality reduction Principal Component Analysis [[2](https://www.sciencedirect.com/science/article/abs/pii/0169743987800849)] was performed on the feature vectors. PCA was preferred since it reduces the noise in the data by reducing correlated features which don’t factor in the final decision making.<br/>

In [None]:
## 2- Performing Principle component analysis (PCA) on the vectorized body text##

# Since the TF-IDF vector was sparse PCA was performed using TruncatedSVD 
from sklearn.decomposition import TruncatedSVD

PCA = TruncatedSVD(2)
Xpca = PCA.fit_transform(X)


# Visualization and Evaluation


**Clustering**

>Since we had data featured as text, the vectors of each document are expected to be placed in space based on the similarity of their content. An interesting way of finding the patterns and visualizing it, is k-means clustering [[6](https://www.jstor.org/stable/2346830?seq=1)], an unsupervised learning method that takes the feature vectors and numbers of clusters as input and provides us with an output of corresponding clusters. The optimal number of cluster was found using Elbow method. 

**Quantitative Measure of Performance of the Vectorization Algorithm ** 

>The first experiment makes use of the algorithm suggested by S. Vajda et al. [[5](https://www.researchgate.net/publication/316550769_A_Fast_k-Nearest_Neighbor_Classifier_Using_Unsupervised_Clustering)]. To quantitatively measure the feature vectors obtained from different algorithms, the S. Vajda et al. [[5](https://www.researchgate.net/publication/316550769_A_Fast_k-Nearest_Neighbor_Classifier_Using_Unsupervised_Clustering)] proposes an algorithm which can be summarized in the following steps:<br/>
>1. Cluster all the feature vectors obtained from the feature engineering step.<br/> 
>2. Obtain labels for all the feature vectors. This would now act as your labeled data for the following steps.<br/>
>3. Split the data into train and test sets.<br/>
>4. Perform K-Nearest Neighbour classification of the test set using the labels of the training set.<br/>
>5. Measure the accuracy of KNN.<br/>

>Moreover to better judge the performance of the algorithm we have performed K-fold cross-validation. The results of different algorithms have been compared in the bar graph. It was observed that TF-IDF vectorization performed better than the other algorithms.<br/> 



In [None]:
## Helper function for plotting the clustering results ## 
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import seaborn as sns


def plotting(datapoint, labels):
    # sns settings
    sns.set(rc={'figure.figsize':(15,15)})

    # colors
    palette = sns.color_palette("bright", len(set(labels)))

    # plot
    sns.scatterplot(datapoint[:, 0], datapoint[:, 1], hue=labels, legend='full', palette=palette)
    plt.title("Covid-19 Articles - Clustered(K-Means) - Using Vectorized body text")

    plt.savefig("/kaggle/working/covid19_label_TFIDF.png")
    plt.show()

In [None]:
## 3 & 4: visualization and evaluatoin ##
## Helper function for quantitatively comparing and measuring accuracy using a combination of clustering + KNN [5] ##

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.model_selection import cross_val_score


def acc(data_vectors):
    
    # Cluster the vectors find their labels 
    kmeans_model = KMeans(n_clusters=9, init='k-means++', max_iter=100) 
    kmeans_model.fit(data_vectors)
    labels=kmeans_model.labels_.tolist()
    plotting(data_vectors, labels) # Plot the clusters 
    
    # Perform KNN on using the labels from clustering
    neigh = KNeighborsClassifier(n_neighbors=15)

    # performing cross validation 
    scores = cross_val_score(neigh, data_vectors, labels, cv=5)
    acc = np.average(scores)
    return acc 

print(acc(Xpca))

The table below shows the comparison of our algorithm which uses TF-IDF+PCA with various other popular algorithms based on the qantitative measure of performance as suggested by S. Vajda et al. [[5](https://www.researchgate.net/publication/316550769_A_Fast_k-Nearest_Neighbor_Classifier_Using_Unsupervised_Clustering)]. 

In [None]:
%pylab inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img=mpimg.imread('/kaggle/input/assets/Comparison of different algorithms(1).png')
plt.figure(figsize=(12,12))
imgplot = plt.imshow(img)
plt.show()

In [None]:
# Dropping irrelavant columns for further processing
try:
    df_covid = df_covid.drop([ "language", "processed_text", "protein_match", "sars_match", "mers_match", "covid_match", "genom_match", "paper_id"], axis=1) # keep "paper_id", "abstract", "authors", "title", "journal", "body_text"
except:
    print("Items already dropped")


# rearranging the data columns for further visualizations 
cols = ['title', 'abstract', 'authors', 'journal', 'body_text']
df_covid = df_covid[cols]

In [None]:
# Helper function for plotting word clouds for various questions related to task 3

from wordcloud import WordCloud

def word_cloud(df, filename):
    long_string = ','.join(list(df['body_text'].values[:20])) # Create a WordCloud object using the top 20 papers 
    wordcloud = WordCloud(background_color="white", max_words=50000, contour_width=5, contour_color='steelblue') # Generate a word cloud
    wordcloud.generate(long_string)
    wordcloud.to_image()
    wordcloud.to_file(filename + '.png')
    
    


# Task Description Projection and Sorting Documents based on Cosine Similarity 

>To find the most relevant documents we need to find the similarity between the document features vectors and the task projections. To perform this task, we projected the task description to the same feature space as the body text documents. For this, the task description was pre-processed using the same pre-processing steps. The processed task description was vectorized using the pre-trained TF-IDF model. The task vectors were projected to a lower dimension using the pre-trained PCA model.<br/>

>Now to compare documents cosine similarity was used. This reason behind this choice was that even if a document is far apart in the euclidean distance (due to the size of the document) it could still be oriented closer.<br/>

>We have also formed word clouds of the top 20 most relevant documents to see the top keywords.<br/>


**Supervised Qualitative Analysis of Results** 

>A supervised evaluation was conducted to better judge and compare the performance of the algorithms. Dr. Hassan Vahidnezhad evaluated the word clouds along with the looking at the table containing the top 10 most similar documents. Based on his expert opinion, TF-IDF performed better than Doc2Vec and Doc2Vec+TF-IDF based on the table and keywords that the doctor observed.




In [None]:


from scipy import spatial
from sklearn.neighbors import NearestNeighbors
def get_closest_neighbours_table(question, vector_body_text, vectorizer_model, pca_model, df, filename):    
    
    # Vectorize the task vector using the pre-trained TF-IDF model
    task_X = vectorizer_model.transform(question)
    # Reduce the dimensionality of the task vector to 2 using pre-trained PCA model
    task_out = pca_model.transform(task_X)    
    filename = filename.replace(' ', '_')   
    
    # calculating the cosine similarity of the tasks between
    df['cosine_similarity'] = [1 - spatial.distance.cosine(a, task_out) for a in vector_body_text]    # Sorting the papers based on the cosine similarity scores
    df = df.sort_values(by=['cosine_similarity'], ascending=False)    
    
    # picking the top 10 papers based on highest cosine similarity
    df = df.iloc[:10]
    df['body_text'] = df['body_text'].apply(lambda x: x[:min(1000, len(x))])    
    
    # Save the dataframe to html for better visualization of data.
    html = df.to_html()
    #write html to file
    text_file = open( filename +".html", "w")
    text_file.write(html)
    text_file.close()    
    
    return df


In [None]:
# Increasing the maximum column width to 300 for better visualization of tables 

pd.set_option('display.max_colwidth', 300)

In [None]:
# Finding the most similar documents to questions related to task 3 as suggested by Dr. Hassan Vahidnezhad

question1 = 'How is the corona virus infecting the animals different from the covid-19 virus that is infecting humans'
question_preprocessed = [process(question1)] # Performing the same pre-processing step on the question document
df = get_closest_neighbours_table(question_preprocessed, Xpca, vectorizer, PCA, df_covid, question1) # Finding the closest neighbours 
print(question1)
df.head()



In [None]:
# Make of word cloud using the top 20 papers.

long_string = ','.join(list(df['body_text'].values[:20])) # Create a WordCloud object using the top 20 papers 
wordcloud = WordCloud(background_color="white", max_words=50000, contour_width=5, contour_color='steelblue') # Generate a word cloud
wordcloud.generate(long_string)
wordcloud.to_image()

In [None]:


question2 = 'What part of the covid-19 virus genome determines the suitable host for this coronavirus'
question_preprocessed = [process(question2)]
df = get_closest_neighbours_table(question_preprocessed, Xpca, vectorizer, PCA, df_covid, question2)
print(question2)
df.head()



In [None]:
# Make of word cloud using the top 20 papers.

long_string = ','.join(list(df['body_text'].values[:20])) # Create a WordCloud object using the top 20 papers 
wordcloud = WordCloud(background_color="white", max_words=50000, contour_width=5, contour_color='steelblue') # Generate a word cloud
wordcloud.generate(long_string)
wordcloud.to_image()

In [None]:
# Finding the most similar documents to complete task 3 desciption.  

task3 = 'Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time. Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged. Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over. Evidence of whether farmers are infected, and whether farmers could have played a role in the origin. Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia. Experimental infections to test host range for this pathogen. Animal host(s) and any evidence of continued spill-over to humans. Socioeconomic and behavioral risk factors for this spill-over. Sustainable risk reduction strategies.'
question_preprocessed = [process(task3)]
df = get_closest_neighbours_table(question_preprocessed, Xpca, vectorizer, PCA, df_covid, 'General task3 description')
print('General task3 description')
df.head()



In [None]:
# Make of word cloud using the top 20 papers.

long_string = ','.join(list(df['body_text'].values[:20])) # Create a WordCloud object using the top 20 papers 
wordcloud = WordCloud(background_color="white", max_words=50000, contour_width=5, contour_color='steelblue') # Generate a word cloud
wordcloud.generate(long_string)
wordcloud.to_image()

# Algorithm Overview and Discussion

>In the following subsections, main stages of our analysis and technical approach are explained.

>1- Dataset Analysis<br/>
>2- Pre-processing<br/>
>3- Feature engineering<br/>
>4- Task description projection and sorting documents based on
>5- Clustering and KNN<br/>
>6- Text classification model building and its evaluation<br/>




> 1 - **Dataset analysis**
>>Defining influential factors based on specific features of a defined task is crucial as data grows and the complexity increases. Hence our group decided to label the provided large dataset based on the language of the article, if the article is related to COVID-19, whether the article is related to genomic sequence of virus, and whether it has discussions on receptor surface of proteins of target virus. We believe this labeling helps us to narrow down the noise in the dataset, which means that we drop the articles that are not in this criteria, and it affects the performance of the classifier.<br/>
>>In this process, our group had followed a semi-supervised approach which means after reducing the number of articles, some of the literatures have been selected randomly and have been read by us to check whether it includes the criteria that we were expecting.<br/>


> 2 - **Data pre-processing**
>>Even Though the extracted data is subject-oriented, it still contains noise which is from irrelevant terms, symbols, and punctuation marks. These undesired data will reduce the performance of the classifier therefore, they should be removed from the corpus via following steps:<br/>

>>2.1. **Stop words removal**:<br/>
>>There always exist many unrelated terms such as ‘is’, ‘a’, ‘this’ and etc in a text which do not have any value in terms of feature generation. However if we don't omit them from the body of text, they increase the dimensionality of features, therefore, results in computational complexity of text classification model.<br/>

>>2.2. **Character transformation**:<br/>
>>A word may appear differently in one corpus or different corpora. Therefore, letter transformation should be used to convert all the tokens into a standard format.<br/>

>>2.3. **Character transformation**:<br/>
>>Usually, words in languages take several forms, that can be due to the grammatical context, which then should be considered as inflected. This matter happens mostly to verbs but also to nouns and adjectives in English language. Normalization to token reduces the inflection of words thus reducing the dimensionality of features.<br/>


> 3 - **Feature engineering**
>>To perform machine learning algorithms on textual data it needs to be converted from text format to the numeric format. The method used for converting the preprocessed text data into feature vectors is called feature engineering. For this section, we have used four different methods.<br/>

>>3.1. **Term Frequency Inverse Document Frequency (TF-IDF)**:<br/>
>>After pre-processing the body text, TF-IDF vectorization [[1](https://www.researchgate.net/publication/326425709_Text_Mining_Use_of_TF-IDF_to_Examine_the_Relevance_of_Words_to_Documents)] was utilized to vectorize the text. The major advantage of TF-IDF based vectorization was that it can be utilized to score and rank documents based on a user query. Moreover, it diminishes the weight of commonly occurring words and increases the weight of words that occur rarely. TF-IDF has also shown excellent results in search engines. <br/>
>>The below code utilizes scikit learn’s implementation of TF-IDF to perform the task.<br/> 
>>After performing TF-IDF vectorization, to perform unsupervised dimensionality reduction Principal Component Analysis [[2](https://www.sciencedirect.com/science/article/abs/pii/0169743987800849)] was performed on the feature vectors. PCA was preferred since it reduces the noise in the data by reducing correlated features which don’t factor in the final decision making.<br/>

>>3.2. **Doc2Vec**:<br/>
>>Doc2vec[[3](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)] is used to represent a document based on its content with a feature vector. This algorithm is based on Word2Vec which is used for word embeddings. The intuition behind using Doc2Vec was that it takes the word order into account. It has also proven to generalize to longer documents and can learn from unlabelled data.<br/>
>>Therefore we used gensim library implementation. The algorithm is fed by preprocessed text data that is encapsulated as tagged documents. The model was trained for 10 epochs.<br/>
>>Parameters used for Doc2vec are as follow:<br/>
>>Vector-size = 100, dimensionality of the feature vector (this is one the benefits of Doc2Vec compared to Tf-Idf because it has less dimensionality).<br/>
>>Min count = 100, The algorithm ignores all the words which were repeated less than this parameter.<br/>
>>Window size = 10, we tried to allow larger numbers of words per entry so that the algorithm can generate better feature vectors.<br/>

>>3.3. **Doc2Vec and TF-IDF together**:<br/>
>>We performed an ensemble of Doc2Vec and TF-IDF. Both Doc2Vec and TF-IDF were separately trained on the processed body text. After which both their dimensionality was separately reduced using PCA to 2 components. The two resulting feature vectors were concatenated.<br/>

>>3.4. **Bigram phrase modeling and Doc2Vec**:<br/>
>>We implemented bigram phrase detection with gensim library, which detects frequently use phrases of two words and considers them together. Then we transformed our corpus into a bigram model. Afterwards we trained our doc2vec model with this newly generated data. But this time our doc2vec model was a concatenation of DBOW ( distributed bag of words) and DMM (Distributed Memory Mean). Each doc2vec model is trained separately under 30 epochs with bigram generated corpus and then the feature vectors are combined together.<br/>


> 4 - **Task description projection and sorting documents based on <u>cosine similarity</u>**
>>To find the most relevant documents we need to find the similarity between the document features vectors and the task projections. To perform this task, we projected the task description to the same feature space as the body text documents. For this, the task description was pre-processed using the same pre-processing steps. The processed task description was vectorized using the pre-trained TF-IDF model. The task vectors were projected to a lower dimension using the pre-trained PCA model.<br/>
>>Now to compare documents cosine similarity was used. This reason behind this choice was that even if a document is far apart in the euclidean distance (due to the size of the document) it could still be oriented closer.<br/>
>>We have also formed word clouds of the top 20 most relevant documents to see the top keywords.<br/>

> 5 - **Clustering**
>>Since we had data featured as text, the vectors of each document are expected to be placed in space based on the similarity of their content. An interesting way of finding the patterns and visualizing it, is k-means clustering [[6](https://www.jstor.org/stable/2346830?seq=1)], an unsupervised learning method that takes the feature vectors and numbers of clusters as input and provides us with an output of corresponding clusters. To find the optimal numbers of clusters with the elbow method, we had to reduce the dimension of feature vectors because otherwise, the graph was not showing consistent behaviour. To project a high-dimensional data onto a two-dimensional space, we used PCA on the vectors obtained with doc2vec and Tf-idf together.<br/> 
>>As you can see we have now a very distinct cluster of vectors. All 4 vectorization methods are showing a good clustering performance.<br/>
>>please look at the following markdown image1.<br/>


> 6 - **Text classification model building and its evaluation**
>>Two unsupervised and supervised experiments were performed to evaluate the performance of the algorithm. The algorithm was compared with other popular text mining algorithms to show the efficiency of this approach.<br/> 
>>The first experiment makes use of the algorithm suggested by S. Vajda et al. [[5](https://www.researchgate.net/publication/316550769_A_Fast_k-Nearest_Neighbor_Classifier_Using_Unsupervised_Clustering)]. To quantitatively measure the feature vectors obtained from different algorithms, the S. Vajda et al. [[5](https://www.researchgate.net/publication/316550769_A_Fast_k-Nearest_Neighbor_Classifier_Using_Unsupervised_Clustering)] proposes an algorithm which can be summarized in the following steps:<br/>

>>6.1. Cluster all the feature vectors obtained from the feature engineering step.<br/> 
>>6.2. Obtain labels for all the feature vectors. This would now act as your labeled data for the following steps.<br/>
>>6.3. Split the data into train and test sets.<br/>
>>6.4. Perform K-Nearest Neighbour classification of the test set using the labels of the training set.<br/>
>>6.5. Measure the accuracy of KNN.<br/>

>>Moreover to better judge the performance of the algorithm we have performed K-fold cross-validation. The results of different algorithms have been compared in the bar graph. It was observed that TF-IDF vectorization performed better than the other algorithms.<br/> 
>>For the second experiment, a supervised evaluation was conducted. After sorting the documents using cosine similarity, word clouds were formed picking the top 20 papers. Dr…. evaluated this word cloud along with the looking at the table containing the top 10 most similar documents. Based on his expert opinion, TF-IDF performed better than Doc2Vec and Doc2Vec+TF-IDF based on the table and keywords that the doctor observed.<br/>
>><br/>
>>**_Based on our evaluation of both these quantitative and qualitative measures, TF-IDF stood superior than the other algorithms. This made it the ideal choice to solve that text mining task and find research documents that are most relevant to task 3._**<br/> 





In [None]:
%pylab inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img=mpimg.imread('/kaggle/input/assets2/Assets/IMG_5188.JPG')
plt.figure(figsize=(20,20))
imgplot = plt.imshow(img)
plt.show()

![](http://)

# Acknowledgements

We would like to extend our gratitute toward Dr. Hassan Vahidnezhad and his team at Jefferson Institute of Molecular Medicine and the Department of Dermatology and Cutaneous Biology, Thomas Jefferson University Hospital, Philadelphia, PA for his expert medical opinions and helping us achieve the best results. He was instrumental in further helping us understand subtasks which medical experts are currently interested in understanding. We would also like to thank Dr. Prof. Dr. Vahidzadeh for his constant support and guidance. He helped us understand the intricate details and the specific areas which our text mining should focus on to show the best results. We should take to acknowledge Dr. Sara Sadr, M.D. Faculty of Medicine, Mazandaran University of Medical Sciences and Dr. Fateme Jafari, M.D. Faculty of Medicine, Tehran University of Medical Science for their valuable insights. They helped us gather relevant keywords and more specific questions to make our text mining results more accurate.

# References 

[1] Qaiser, Shahzad & Ali, Ramsha. (2018). Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications. 181. 10.5120/ijca2018917395.

[2] Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." In International conference on machine learning, pp. 1188-1196. 2014.

[3] Wold, Svante, Kim Esbensen, and Paul Geladi. "Principal component analysis." Chemometrics and intelligent laboratory systems 2, no. 1-3 (1987): 37-52.

[4] Peterson, Leif E. "K-nearest neighbor." Scholarpedia 4, no. 2 (2009): 1883.

[5] Vajda S., Santosh K.C. (2017) A Fast k-Nearest Neighbor Classifier Using Unsupervised Clustering. In: Santosh K., Hangarge M., Bevilacqua V., Negi A. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2016. Communications in Computer and Information Science, vol 709. Springer, Singapore

[6] J.  A.  Hartigan and  M.  A.  Wong,  “Algorithm as  136:  A  k-means clustering algorithm,”Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979
