# Task 3 - CORD Hackathon Submission

In [None]:
from IPython.display import Image


In [None]:
Image("/kaggle/input/task3-data/Images/Images/Cover2.PNG")


## Introduction

Covid-19 has plunged the world into an unprecedented crisis and the global medical community is working tirelessly to not only contain the pandemic but also prepare to manage the disease effectively in the long term. Key to this endeavor is a robust fundamental understanding of the disease itself – how it originated, what do the genetics tell us and can we learn and predict its evolution. There is a myriad of existing research that contains crucial knowledge, but a lack of standard nomenclature or a scheme of organization makes it a difficult resource for the scientific community to leverage. 
To alleviate this constraint, our team has worked towards delivering a scalable solution that makes it easier for the medical researchers and global scientific community globally to find what they need from this ever-expanding trove of disease-related research. 


## Task Details

What do we know about virus genetics, origin, and evolution? 
What do we know about the virus origin and management measures at the human-animal interface?

Specifically, we want to know what the literature reports about:

Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time.
Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged.
Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.
Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.
Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.
Experimental infections to test host range for this pathogen.
Animal host(s) and any evidence of continued spill-over to humans
Socioeconomic and behavioral risk factors for this spill-over
Sustainable risk reduction strategies

## What are we delivering?
The following Notebook will provide a data extraction mechanism for researchers to readily find relevant publications regarding the overall Task 3 objective

## Approach
The tool that the Team had to create must be powerful enough to extract data from nearly 60,000 available publications to be searched regarding coronavirus or the more specific COVID-19. General “Find” or “Google” searches were not powerful enough to support this type of data-mining platform, so the team had to be creative in order to be successful. 
The data set used is the COVID-19 Complete Dataset via Kaggle.

Below is an outline of this team’s effort to provide a viable research tool from scientists to utilize to cross-reference other publications across the world.

#### Section 1 : Real-time tracking of whole genomes to track variations of the virus over time

#### Section 2: Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences

#### Section 3 : Evidence that livestock could be infected
    3.1 Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over: Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.

    3.2 Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.: Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.

    3.3 Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.: Experimental infections to test host range for this pathogen.

#### Section 4:  What do we know about virus genetics, origin, and evolution? What do we know about the virus origin and management measures at the human-animal interface?

    Task 4.1- Animal host(s) and any evidence of continued spill-over to humans
    Task 4.2- Sustainable risk reduction strategies
    Task 4.3- Socioeconomic and behavioral risk factors for this spill-over
    

#### Section 5:  Retrieve relevant articles using BERT



### Note :  A few cells are commented to save time. Processed files are saved In order to view processing steps, plese un comment the relevant code

## Import Packages

Python was chosen as the platform to build the tool for its versatility and ubiquity. The final deliverable involved complex text processing, array manipulation and search, and creative visualizations. For this purpose, several specialized Python libraries were utilized including pandas, numpy, scipy and sklearn. 

In [None]:

!pip install bert-extractive-summarizer
#!pip install nxviz

In [None]:
!pip install -U sentence-transformers

In [None]:
import pandas as pd
import os
import numpy as np
from tqdm.notebook import tqdm
import scipy


import textwrap
import json
import logging
import pickle
import warnings
warnings.simplefilter('ignore')

import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('wordnet') 
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk import tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer, word_tokenize
from nltk.translate.bleu_score import sentence_bleu

import json
import glob
import string


from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from IPython.core.display import display, HTML
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
#from sentence_transformers import models, SentenceTransformer
import shutil

import torch
from transformers import BertTokenizer, BertModel

import pandas as pd

from nltk.tokenize import word_tokenize
import numpy as np

# Load the library with the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

from summarizer import Summarizer
import math

import os, stat
import ntpath
from string import ascii_uppercase
#import nxviz


In [None]:
%matplotlib inline
import numpy as np
import math
import matplotlib
import matplotlib.pyplot as plt
import os, stat
import ntpath
from string import ascii_uppercase

In [None]:
import spacy
from spacy.matcher import PhraseMatcher #import PhraseMatcher class


In [None]:
import matplotlib
import matplotlib.pyplot as plt
from wordcloud import WordCloud

## Download Data (In case you want a local copy of the data)

In [None]:
# Specify the Kaggle Username and Key to use the Kaggle Api

# os.environ['KAGGLE_USERNAME'] = '*************'
# os.environ['KAGGLE_KEY'] = '****************'

In [None]:
# from kaggle.api.kaggle_api_extended import KaggleApi

# api = KaggleApi()
# api.authenticate()

# api.dataset_download_files(dataset="allen-institute-for-ai/CORD-19-research-challenge", path=DATA_PATH, unzip=True)

# HTML('''<script>
# code_show=true; 
# function code_toggle() {
#  if (code_show){
#  $('div.input').hide();
#  } else {
#  $('div.input').show();
#  }
#  code_show = !code_show
# } 
# $( document ).ready(code_toggle);
# </script>
# The raw code for this IPython notebook is by default hidden for easier reading.
# To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

## Data Pre-processing

For the ease of searching and indexing, all the research articles were converted into flat text files and stored in a table format. In this step, all the articles were parsed and filtered for relevance. As a framework, we used categories of keywords. For example, to address animal source, the keywords were categorized as animal words, spill-over words and virus words. Through this exercise, we were able to narrow down the number of relevant articles to around 8000. This framework was also utilized for the other two sub-tasks. 

### Specify Data Folders

In [None]:
DATA_PATH = os.getcwd()+'/kaggle/input/CORD-19-research-challenge/'

### Combine articles to a dataframe

To save time, combined data is stored as a csv 'fin_df.csv'. That is loaded and used. Uncooment the below lines to combine articles to a DataFrame

In [None]:
#Get data from path '/data1/cov19/kaggle_data_0331/'
'''
bio_path = '/kaggle/input/CORD-19-research-challenge//biorxiv_medrxiv/'
comm_path = '/kaggle/input/CORD-19-research-challenge//comm_use_subset/'
non_comm_path = '/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/'
custom_path = '/kaggle/input/CORD-19-research-challenge/custom_license/'
journals = {"BIORXIV_MEDRXIV": bio_path,
             "COMMON_USE_SUB" : comm_path,
             "NON_COMMON_USE_SUB" : non_comm_path,
             "CUSTOM_LICENSE" : custom_path}
'''

In [None]:
'''
def parse_each_json_file(file_path,journal):
    inp = None
    with open(file_path) as f:
        inp = json.load(f)
    rec = {}
    rec['sha'] = inp['paper_id'] or None
    rec['title'] = inp['metadata']['title'] or None
    abstract = "\n ".join([inp['abstract'][_]['text'] for _ in range(len(inp['abstract']) - 1)]) 
    rec['abstract'] = abstract or None
    full_text = []
    for _ in range(len(inp['body_text'])):
        try:
            full_text.append(inp['body_text'][_]['text'])
        except:
            pass

    rec['full_text'] = "\n ".join(full_text) or None
    rec['source'] =  journal     or None    
    return rec

def parse_json_and_create_csv(journals):
    journal_dfs = []
    for journal, path in journals.items():
        parsed_rcds = []  
        json_files = glob.glob('{}/**/*.json'.format(path), recursive=True)
        for file_name in json_files:
            rec = parse_each_json_file(file_name,journal)
            parsed_rcds.append(rec)
        df = pd.DataFrame(parsed_rcds)
        journal_dfs.append(df)
    return pd.concat(journal_dfs)


fin_df = parse_json_and_create_csv(journals=journals)
fin_df.head()

'''

In [None]:
fin_df = pd.read_csv('/kaggle/input/task3-data/fin_df.csv')

In [None]:
fin_df.head()

## Section 1 : Real-time tracking of whole genomes to track variations of the virus over time
 

### Method to measure variations in different Genome sequences

### 1.1 Approach


To accomplish this task we follow the steps below:oRetreive the genome suquences from publicly available realtime databases   such   as China   National   GeneBank& GISAID EpiCoV™ DatabaseoUse  same  algorithm  from  Task  3.2  to classify  wether  the  new samples or sequences belong to exsiting or new groups oDetermine wether  or  not  if  new  sequences  are  changing  over time•This task tracks temporal variations •Bootstrap confidence interval of the measured metric•Hierarchical classification of the sample sequenceThis subtask relates to the evolution of the genome sequence in time.  To study these  variations,we  use  the  same  approach  as  described  in  section  3.2.    The only two differences are as follows.  1.First,   the   input sequences   must   be   specifiedfor   an   individual geographical  location,  to  remove  the  spatial  dependency.    Also,  these sequences  must  be  specified  for  a  given  period  of  time  and  sorted  by date in ascending order.  2.The algorithm of section 3.2 is then appliedbut is modified so that it only analyzes the consecutive pairs of sequences, e.g. 1st & 2nd sequences; 2nd & 3rd; 3rd & 4th, ..., (Nth-1)& Nth, where N is the number of input sequences.


Our approach is to convert the genome sequences to waveforms and determine a metric that measures how much different or similar two  sequences are.    In order to do this, we find the confidence interval of this metric  and compare it to different threshold levels

#### Method executed with two genome sequences

Genome Sequence ConversionIn this step, the input genome sequence is converted to a random signal.  This is  useful,  since  the  new waveform  is  much  easier  to  visualize and  manipulate than  letter  patterns.    Furthermore,  this  allows  us  to  use  signal  correlation  in order to find how different or similar the two signals are.    The Figure below shows the signal conversion for the two input sequences that will  be  analyzed.  Each  of  the  characters  of  the  sequence  is  translated  to  a complex random number where the same random number is assigned to every character occurrence.

In [None]:
Image("/kaggle/input/task3-data/Images/Images/SequenceConversion.PNG")

#### Reading and Assembling Genome sequence file

In [None]:
###-------------------------------------------------------------
### READING AND ASSEMBLING GENOME SEQUENCE FILE
###-------------------------------------------------------------
#File 1 to compare
fname = open("/kaggle/input/task3-data/gnome_data_countries/gnome_data_countries/CHINA_SHENZHEN_MN938384.1.fasta","r")
A = fname.read().replace('\n','')
fname.close()
sequence1 = (list(A))

#File2 to compare
fname2 = open("/kaggle/input/task3-data/gnome_data_countries/gnome_data_countries/italy_MT066156.1.fasta","r")
B = fname2.read().replace('\n','')
fname.close()
sequence2 = (list(B))

#### Conversion of Sequence into a time domain Complex signal

Since  the  input  sequences  are  about  30,000  samples  long,  the  signal  is divided in equal length segments.  The length of the segment selected in our study is 1024 but it can be changed to any given value.  Figure below showshow the signal is segmented.  In this example,we get 30 segments of size 1024 for the ~30,000 samples of the input sequence.

In [None]:
Image("/kaggle/input/task3-data/Images/Images/SignalSegmentation.PNG")

In [None]:
Image("/kaggle/input/task3-data/Images/Images/SC1.PNG")

In [None]:
Image("/kaggle/input/task3-data/Images/Images/SC2.PNG")

In [None]:
Image("/kaggle/input/task3-data/Images/Images/SC3.PNG")

### Add here

In [None]:
###-------------------------------------------------------------
### corrERSION OF SEQUENCE INTO A TIME DOMAIN RANDOM COMPLEX SIGNAL
###-------------------------------------------------------------
from string import ascii_uppercase

seq_len0 = len(sequence1)
seq_len1 = len(sequence2)
    
signal0 = np.zeros(seq_len0, dtype=np.complex_)
signal1 = np.zeros(seq_len1, dtype=np.complex_)
    
    
max_seq_len=max(seq_len0,seq_len1);
### 26 is total number of alphabets
L = 26


RANDOM_VEC = (1/np.sqrt(2))*((np.random.randn(1,L))+1j*(np.random.randn(1,L)))

prob0 = np.zeros(L)
prob1 = np.zeros(L)

for i in ascii_uppercase:
    prob0[ord(i)-65] =  sequence1.count(i)
    prob1[ord(i)-65] =  sequence2.count(i)
  
       
for i in range (seq_len0):
    signal0[i] = RANDOM_VEC[0][ord(sequence1[i])-65]
    #print('signal0[', i, ']=', signal0[i], '\n')
        
for i in range (seq_len1):
    signal1[i] = RANDOM_VEC[0][ord(sequence2[i])-65]

#### TIME DOMAIN SIGNAL SEGMENTATION - FFT SIZE

In [None]:
###------------------------------------------------
###TIME DOMAIN SIGNAL SEGMENTATION - FFT SIZE
###------------------------------------------------
fft_size = 1024;
import math
rows = math.ceil(max_seq_len/fft_size)
cols = fft_size

###print("seq_len=", seq_len, " rows=", rows, " columns=", cols)

seq_arr0=np.zeros((rows,cols), dtype=np.complex_)
seq_arr1=np.zeros((rows,cols), dtype=np.complex_)


for row_pointer in range (rows):
    start_index = fft_size*(row_pointer)
    end_index = fft_size*(row_pointer+1)
    if(end_index <= len(sequence1)):
        seq_arr0[row_pointer,:] = signal0[start_index:end_index]
    else:
        if (start_index >= len(sequence1)):
            ###fill all 0
            seq_arr0[row_pointer,:] = np.zeros(1,fft_size);
                
        else:
            ###copy till seq_len and then all 0s
            ###print('row pointer:', row_pointer, ' start:',start_index, ' end:',end_index, ' seq_len:', seq_len0)
            seq_arr0[row_pointer,0:(seq_len0-start_index)] = signal0[start_index:seq_len0]
            seq_arr0[row_pointer,(seq_len0-start_index):] = np.zeros((1,end_index-seq_len0), dtype=np.complex_)
                
for row_pointer in range (rows):
    start_index = fft_size*(row_pointer)
    end_index = fft_size*(row_pointer+1)
    if(end_index <= len(sequence2)):
        seq_arr1[row_pointer,:] = signal1[start_index:end_index]
    else:
        if (start_index >= len(sequence2)):
            ###fill all 0
            seq_arr1[row_pointer,:] = np.zeros(1,fft_size);
                
        else:
            ###copy till seq_len and then all 0s
            seq_arr1[row_pointer,0:(seq_len1-start_index)] = signal1[start_index:seq_len1]
            seq_arr1[row_pointer,(seq_len1-start_index):] = np.zeros((1,end_index-seq_len1), dtype=np.complex_)


#### SIGNAL IN FREQ DOMAIN 

In [None]:
###------------------------------------------------
### SIGNAL IN FREQ DOMAIN  -- CHECK
###------------------------------------------------

SEQ_ARR0=np.zeros((rows,cols), dtype=np.complex_)
SEQ_ARR1=np.zeros((rows,cols), dtype=np.complex_)
for j in range (rows):
    SEQ_ARR0[j,:]=np.fft.fft(seq_arr0[j,:])
    SEQ_ARR1[j,:]=np.fft.fft(seq_arr1[j,:])


#### SIGNAL CORRELATION IN FREQ DOMAIN

In [None]:
###------------------------------------------------
###SIGNAL CORRELATION IN FREQ DOMAIN
###------------------------------------------------
nrof_segments=rows
    
CORR_CELL=np.zeros((nrof_segments,2,2,cols), dtype=np.complex_)
    
for m in range (nrof_segments):
    CORR_CELL[m, 0, 0, :]= np.multiply( (SEQ_ARR0[m,:]),(np.conj(SEQ_ARR0[m,:])) )
    CORR_CELL[m, 0, 1, :]= np.multiply( (SEQ_ARR0[m,:]),(np.conj(SEQ_ARR1[m,:])) )
    CORR_CELL[m, 1, 0, :]= np.multiply( (SEQ_ARR1[m,:]),(np.conj(SEQ_ARR0[m,:])) )
    CORR_CELL[m, 1, 1, :]= np.multiply( (SEQ_ARR1[m,:]),(np.conj(SEQ_ARR1[m,:])) )

#### BACK TO TIME DOMAIN - correlation of piece wise time domain signal

In [None]:
###------------------------------------------------
### BACK TO TIME DOMAIN - correlation of piece wise time domain signal
###------------------------------------------------

corr_cell=np.zeros((nrof_segments,2,2,cols), dtype=np.complex_)
coef0 = np.zeros(nrof_segments);
coef1 = np.zeros(nrof_segments);
for m in range (nrof_segments):
    for i in range (2):
        for j in range(2):
            corr_cell[m, i, j, :]= np.fft.ifft(CORR_CELL[m,i,j])
    coef0[m]=np.absolute(corr_cell[m, 0, 0, 0])
    coef1[m]=np.absolute(corr_cell[m, 1, 0, 0])

#### Plot Data

In [None]:
###------------------------------------------------
###PLOT DATA
###------------------------------------------------
import matplotlib.pyplot as plt
metric1_arr = np.divide(coef0,coef1)
metric1 = np.sum(np.absolute(metric1_arr) - 1)
plt.rcParams['figure.figsize'] = [15,10]
fig, axs = plt.subplots(2,1, constrained_layout=True)

fig = plt.figure()
axs[0].plot(range(nrof_segments), metric1_arr, 'cs')
axs[0].set_title('metric1 : coef1/coef2 - 1 : '+str(metric1));
axs[0].set_xlabel('number of segments');
axs[0].set_ylabel('C1/C2-1');
axs[0].grid()
fig.suptitle('COVID19 GENOME Variation Plot', fontsize=16)
    
metric2_arr = np.subtract(coef0,coef1)
metric2 = np.sum(np.absolute(metric2_arr))
axs[1].plot(range(nrof_segments), metric2_arr, 'gs')
axs[1].set_title('metric2 : coef1-coef2 : '+str(metric2));
axs[1].set_xlabel('number of segments');
axs[1].set_ylabel('C1-C2');
axs[1].grid()


#### Reading Sequence Files

In [None]:
files = []

#currdir = os.getcwd()
#path = currdir + '/' + 'gnome_data'

path = '/kaggle/input/task3-data/gnome_data_countries/gnome_data_countries'
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        if '.txt' in file or '.fasta' in file:
            files.append(os.path.join(r, file))

g_fname = files    

In [None]:
for f in g_fname:
    print(f)

In [None]:
out_file_path = os.getcwd()


## Section 2: Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences

#### Processing Genome files

In [None]:
def run_corr_2files(num1, num2):
    
    
    ###-------------------------------------------------------------
    ### corrERSION OF SEQUENCE INTO A TIME DOMAIN RANDOM COMPLEX SIGNAL
    ###-------------------------------------------------------------
    
    seq_len0 = len(sequence[num1])
    seq_len1 = len(sequence[num2])
    
    
    ###max_seq_len=max(seq_len0,seq_len1);
    min_seq_len = min(seq_len0,seq_len1)
    ### 26 is total number of alphabets
    L = 26
    
    B=100
    metric_samples=np.zeros(B)
    for b in range(B):
        
        signal0 = np.zeros(seq_len0, dtype=np.complex_)
        signal1 = np.zeros(seq_len1, dtype=np.complex_)
    
        RANDOM_VEC = (1/np.sqrt(2))*((np.random.randn(1,L))+1j*(np.random.randn(1,L)))
        
        prob0 = np.zeros(L)
        prob1 = np.zeros(L)
        
        for i in ascii_uppercase:
            prob0[ord(i)-65] =  sequence[num1].count(i)
            prob1[ord(i)-65] =  sequence[num2].count(i)
         
          
        for i in range (seq_len0):
            signal0[i] = RANDOM_VEC[0][ord(sequence[num1][i])-65]
            #print('signal0[', i, ']=', signal0[i], '\n')
            
            
        for i in range (seq_len1):
            signal1[i] = RANDOM_VEC[0][ord(sequence[num2][i])-65]
        
        ###------------------------------------------------
        ###TIME DOMAIN SIGNAL SEGMENTATION - FFT SIZE
        ###------------------------------------------------
        fft_size = 1024
    
        rows = math.ceil(min_seq_len/fft_size)
        cols = fft_size
        ###print("seq_len=", seq_len, " rows=", rows, " columns=", cols)
        
        seq_arr0=np.zeros((rows,cols), dtype=np.complex_)
        seq_arr1=np.zeros((rows,cols), dtype=np.complex_)
        
        for row_pointer in range (rows):
            start_index = fft_size*(row_pointer)
            end_index = fft_size*(row_pointer+1)
            if(end_index <= len(sequence[num1])):
                seq_arr0[row_pointer,:] = signal0[start_index:end_index]
            else:
                if (start_index >= len(sequence[num1])):
                    ###fill all 0
                    seq_arr0[row_pointer,:] = np.zeros((1,fft_size), dtype=np.complex_);
                    
                else:
                    ###copy till seq_len and then all 0s
                    ###print('row pointer:', row_pointer, ' start:',start_index, ' end:',end_index, ' seq_len:', seq_len0)
                    seq_arr0[row_pointer,0:(seq_len0-start_index)] = signal0[start_index:seq_len0]
                    seq_arr0[row_pointer,(seq_len0-start_index):] = np.zeros((1,end_index-seq_len0), dtype=np.complex_)
                    
        for row_pointer in range (rows):
            start_index = fft_size*(row_pointer)
            end_index = fft_size*(row_pointer+1)
            if(end_index <= len(sequence[num2])):
                seq_arr1[row_pointer,:] = signal1[start_index:end_index]
            else:
                if (start_index >= len(sequence[num2])):
                    ###fill all 0
                    seq_arr1[row_pointer,:] = np.zeros((1,fft_size), dtype=np.complex_);
                    
                else:
                    ###copy till seq_len and then all 0s
                    seq_arr1[row_pointer,0:(seq_len1-start_index)] = signal1[start_index:seq_len1]
                    seq_arr1[row_pointer,(seq_len1-start_index):] = np.zeros((1,end_index-seq_len1), dtype=np.complex_)
                              
    
        ###------------------------------------------------
        ###SIGNAL IN FREQ DOMAIN  -- CHECK
        ###------------------------------------------------
        
        SEQ_ARR0=np.zeros((rows,cols), dtype=np.complex_)
        SEQ_ARR1=np.zeros((rows,cols), dtype=np.complex_)
        for j in range (rows):
            SEQ_ARR0[j,:]=np.fft.fft(seq_arr0[j,:])
            SEQ_ARR1[j,:]=np.fft.fft(seq_arr1[j,:])
        
        ###------------------------------------------------
        ###SIGNAL CORRELATION IN FREQ DOMAIN
        ###------------------------------------------------
        nrof_segments=rows
        
        CORR_CELL=np.zeros((nrof_segments,1,2,cols), dtype=np.complex_)
        
        for m in range (nrof_segments):
            CORR_CELL[m, 0, 0, :]= np.multiply( (SEQ_ARR0[m,:]),(np.conj(SEQ_ARR0[m,:])) )
            CORR_CELL[m, 0, 1, :]= np.multiply( (SEQ_ARR0[m,:]),(np.conj(SEQ_ARR1[m,:])) )
                   
        
        ###------------------------------------------------
        ###BACK TO TIME DOMAIN - correlation of piece wise time domain signal
        ###------------------------------------------------
        
        corr_cell=np.zeros((nrof_segments,1,2,cols), dtype=np.complex_)
        coef0 = np.zeros(nrof_segments);
        coef1 = np.zeros(nrof_segments);
        for m in range (nrof_segments):
            corr_cell[m, 0, 0, :]= np.fft.ifft(CORR_CELL[m,0,0])
            corr_cell[m, 0, 1, :]= np.fft.ifft(CORR_CELL[m,0,1])
            coef0[m]=max(np.absolute(corr_cell[m, 0, 0]))
            coef1[m]=max(np.absolute(corr_cell[m, 0, 1]))
            
        coef = np.divide(coef0,coef1) - 1
        
        metric_samples[b] = np.sum(np.absolute(coef))
        
              
    np.sort(metric_samples)
    #print('sorted: ', metric_samples,'\n')
    return(metric_samples[89])

In [None]:
#print(g_fname)
    
sequence = [] 
check_var1 = 0


i = 0

for j in range (len(g_fname)):
    A=''
    with open(g_fname[j]) as fhandler:
        first_line = fhandler.readline()
        if (first_line.find('genome') != -1):
            ###print(first_line)
            print("Discarding first line of:", g_fname[j],'\n')
            ##messagebox.showwarning('Warning', 'Discarding first line: '+first_line)
        else :
            A = first_line

        for line in fhandler:
            A += line   

        A = A.replace('\n','') 

        sequence.append([])
        #####-----------------------------
        #####DNA to Amino
        #####-----------------------------          
        if(check_var1 == 1):
            #####-----------------------------
            #####DNA to RNA
            #####-----------------------------
            A = A.replace('T','U') 

            ####------------------------------
            #### RNA to amino Acid
            ####------------------------------                
            key = ''
            for k in range(0,3*math.floor(len(A)/3),3):
                key = A[k]+A[k+1]+A[k+2]
                ##print("here:",k, A[k],A[k+1],A[k+2],'\n')
                if(rna_to_amino.get(key) == None):
                    print('Data Corrupted for file...abanding:\n', g_fname[j])
                    messagebox.showerror('Error', 'Data corrupted for file: '+g_fname[j]+' => expected only A,C,G and T in the DNA sequence')


                if(rna_to_amino[key] != '*'):
                    sequence[i].append(rna_to_amino[key])
        else :
            sequence[i] = list(A)

    i = i+1

#### Generating Genome Outputs

In [None]:
global out_file_path
nrof_samples = len(g_fname)
nrof_calc_metrics = int((nrof_samples*(nrof_samples-1))/2)

metric_arr = np.zeros(nrof_calc_metrics)
sample_loc_arr = np.chararray((nrof_calc_metrics,1,2), itemsize=100, unicode=True)
sample_number_arr = np.zeros((nrof_calc_metrics,3))
record_arr = np.chararray((nrof_calc_metrics), itemsize=100, unicode=True)
loc_arr = np.chararray((nrof_calc_metrics), itemsize=100, unicode=True)

k=0
for i in range(nrof_samples):
    for j in range((i+1),nrof_samples):
        if(len(sequence[i]) > len(sequence[j])):
            metric_arr[k] = run_corr_2files(i, j)
        else :
            metric_arr[k] = run_corr_2files(j, i)

        sample_loc_arr[k,0,0] = ntpath.basename(g_fname[i]).replace('.fasta','')
        sample_loc_arr[k,0,1] = ntpath.basename(g_fname[j]).replace('.fasta','')
        record_arr[k] = sample_loc_arr[k,0,0]+', '+sample_loc_arr[k,0,1]+' = '+str(metric_arr[k]) 
        loc_arr[k] = sample_loc_arr[k,0,0]+', '+sample_loc_arr[k,0,1]
        print(record_arr[k]+'\n')
        sample_number_arr[k,0] = i;
        sample_number_arr[k,1] = j;
        sample_number_arr[k,2] = metric_arr[k];
        k = k+1

#-----------
#OUtput file
#-----------
sort_metric_arr = np.sort(metric_arr, axis=0)
index = np.argsort(metric_arr, axis=0)
sort_record_arr = []
sort_loc_arr = []
for i in index:
    sort_record_arr.append(record_arr[i])
    sort_loc_arr.append(loc_arr[i])



In [None]:
#remove previous created output folder
#!rmdir 'Outpur'
!mkdir 'Output'

In [None]:
filename_output= 'Output/output_metrics1.csv'
filename_output_raw= 'Output/output_raw_data1.csv'
filename_output_loc= 'Output/output_loc1.txt'
filename_output_sample_number= 'Output//sample_number1.txt'

fileID = open(filename_output,'w');
fileID2 = open(filename_output_raw,'w');
fileID3 = open(filename_output_loc,'w');
fileID4 = open(filename_output_sample_number,'w');
for k in range(nrof_calc_metrics):    
    fileID.write(str(sort_metric_arr[k])+'\n')  
    fileID2.write(sort_record_arr[k]+'\n')   
    fileID3.write(sort_loc_arr[k]+'\n')
    fileID4.write(str(sample_number_arr[k,:])+'\n')

fileID.close()
fileID2.close()
fileID3.close()
fileID4.close()

In [None]:
filename_output

In [None]:
filename_output_raw

In [None]:
filename_output_loc

In [None]:
filename_output_sample_number

#### Output Description

-------------------------------------------------------------------------------------


**output_metrics.txt:**

(M,1) vector,   where M = N* (N-1)/2   # of combinations of the input sequences and N is the number of input sequences N.

[metric ]
[metric ]



**output_raw_data.csv:**   

(M,3) vector
=
[seq1, seq2,metric]
[seq1, seq2,metric]
…


**output_loc.txt:**

(M,3) vector 
seq1,seq2]
[seq1,seq2]
..



**sample_number.txt:**


 (M,3) vector, with elements:  (combination index, ordinal number for seq1, ordinal number for seq2)
=
[1 1 2]
[2 1 2]
[3 1 3]
…

[m 2 3]
[m 2 4]
….

…

[n  N-1, N]


--------------------------------------------------------------------------



In [None]:
files = []

path = 'Output/'

for r, d, f in os.walk(out_file_path):
    for file in f:
        print(file)

### Method Applied to samples form same location in diferent times.

#### Read files

In [None]:
files = []

#currdir = os.getcwd()
#path = currdir + '/' + 'gnome_data'

path = './' + 'gnome_data_time'
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        if '.txt' in file or '.fasta' in file:
            files.append(os.path.join(r, file))

g_fname = files    

In [None]:
for f in g_fname:
    print(f)

In [None]:
# Output File Location
out_file_path = 'Output/'

#### Process files

In [None]:
def run_corr_2files(num1, num2):
    
    
    ###-------------------------------------------------------------
    ### corrERSION OF SEQUENCE INTO A TIME DOMAIN RANDOM COMPLEX SIGNAL
    ###-------------------------------------------------------------
    
    seq_len0 = len(sequence[num1])
    seq_len1 = len(sequence[num2])
    
    
    ###max_seq_len=max(seq_len0,seq_len1);
    min_seq_len = min(seq_len0,seq_len1)
    ### 26 is total number of alphabets
    L = 26
    
    B=100
    metric_samples=np.zeros(B)
    for b in range(B):
        
        signal0 = np.zeros(seq_len0, dtype=np.complex_)
        signal1 = np.zeros(seq_len1, dtype=np.complex_)
    
        RANDOM_VEC = (1/np.sqrt(2))*((np.random.randn(1,L))+1j*(np.random.randn(1,L)))
        
        prob0 = np.zeros(L)
        prob1 = np.zeros(L)
        
        for i in ascii_uppercase:
            prob0[ord(i)-65] =  sequence[num1].count(i)
            prob1[ord(i)-65] =  sequence[num2].count(i)
         
          
        for i in range (seq_len0):
            signal0[i] = RANDOM_VEC[0][ord(sequence[num1][i])-65]
            #print('signal0[', i, ']=', signal0[i], '\n')
            
            
        for i in range (seq_len1):
            signal1[i] = RANDOM_VEC[0][ord(sequence[num2][i])-65]
        
        ###------------------------------------------------
        ###TIME DOMAIN SIGNAL SEGMENTATION - FFT SIZE
        ###------------------------------------------------
        fft_size = 1024
    
        rows = math.ceil(min_seq_len/fft_size)
        cols = fft_size
        ###print("seq_len=", seq_len, " rows=", rows, " columns=", cols)
        
        seq_arr0=np.zeros((rows,cols), dtype=np.complex_)
        seq_arr1=np.zeros((rows,cols), dtype=np.complex_)
        
        for row_pointer in range (rows):
            start_index = fft_size*(row_pointer)
            end_index = fft_size*(row_pointer+1)
            if(end_index <= len(sequence[num1])):
                seq_arr0[row_pointer,:] = signal0[start_index:end_index]
            else:
                if (start_index >= len(sequence[num1])):
                    ###fill all 0
                    seq_arr0[row_pointer,:] = np.zeros((1,fft_size), dtype=np.complex_);
                    
                else:
                    ###copy till seq_len and then all 0s
                    ###print('row pointer:', row_pointer, ' start:',start_index, ' end:',end_index, ' seq_len:', seq_len0)
                    seq_arr0[row_pointer,0:(seq_len0-start_index)] = signal0[start_index:seq_len0]
                    seq_arr0[row_pointer,(seq_len0-start_index):] = np.zeros((1,end_index-seq_len0), dtype=np.complex_)
                    
        for row_pointer in range (rows):
            start_index = fft_size*(row_pointer)
            end_index = fft_size*(row_pointer+1)
            if(end_index <= len(sequence[num2])):
                seq_arr1[row_pointer,:] = signal1[start_index:end_index]
            else:
                if (start_index >= len(sequence[num2])):
                    ###fill all 0
                    seq_arr1[row_pointer,:] = np.zeros((1,fft_size), dtype=np.complex_);
                    
                else:
                    ###copy till seq_len and then all 0s
                    seq_arr1[row_pointer,0:(seq_len1-start_index)] = signal1[start_index:seq_len1]
                    seq_arr1[row_pointer,(seq_len1-start_index):] = np.zeros((1,end_index-seq_len1), dtype=np.complex_)
                              
    
        ###------------------------------------------------
        ###SIGNAL IN FREQ DOMAIN  -- CHECK
        ###------------------------------------------------
        
        SEQ_ARR0=np.zeros((rows,cols), dtype=np.complex_)
        SEQ_ARR1=np.zeros((rows,cols), dtype=np.complex_)
        for j in range (rows):
            SEQ_ARR0[j,:]=np.fft.fft(seq_arr0[j,:])
            SEQ_ARR1[j,:]=np.fft.fft(seq_arr1[j,:])
        
        ###------------------------------------------------
        ###SIGNAL CORRELATION IN FREQ DOMAIN
        ###------------------------------------------------
        nrof_segments=rows
        
        CORR_CELL=np.zeros((nrof_segments,1,2,cols), dtype=np.complex_)
        
        for m in range (nrof_segments):
            CORR_CELL[m, 0, 0, :]= np.multiply( (SEQ_ARR0[m,:]),(np.conj(SEQ_ARR0[m,:])) )
            CORR_CELL[m, 0, 1, :]= np.multiply( (SEQ_ARR0[m,:]),(np.conj(SEQ_ARR1[m,:])) )
                   
        
        ###------------------------------------------------
        ###BACK TO TIME DOMAIN - correlation of piece wise time domain signal
        ###------------------------------------------------
        
        corr_cell=np.zeros((nrof_segments,1,2,cols), dtype=np.complex_)
        coef0 = np.zeros(nrof_segments);
        coef1 = np.zeros(nrof_segments);
        for m in range (nrof_segments):
            corr_cell[m, 0, 0, :]= np.fft.ifft(CORR_CELL[m,0,0])
            corr_cell[m, 0, 1, :]= np.fft.ifft(CORR_CELL[m,0,1])
            coef0[m]=max(np.absolute(corr_cell[m, 0, 0]))
            coef1[m]=max(np.absolute(corr_cell[m, 0, 1]))
            
        coef = np.divide(coef0,coef1) - 1
        
        metric_samples[b] = np.sum(np.absolute(coef))
        
              
    np.sort(metric_samples)
    #print('sorted: ', metric_samples,'\n')
    return(metric_samples[89])

In [None]:
sequence = [] 
check_var1 = 0
i = 0

for j in range (len(g_fname)):
    A=''
    with open(g_fname[j]) as fhandler:
        first_line = fhandler.readline()
        if (first_line.find('genome') != -1):
            ###print(first_line)
            print("Discarding first line of:", g_fname[j],'\n')
            messagebox.showwarning('Warning', 'Discarding first line: '+first_line)
        else :
            A = first_line

        for line in fhandler:
            A += line   

        A = A.replace('\n','') 

        if ((A.count('A')+A.count('T')+A.count('C')+A.count('G')) != len(A)):
            print ('Error :Data corrupted for file: '+g_fname[j]+' => expected only A,C,G and T in the DNA sequence')
            ##root.destroy()
            ##return()

        sequence.append([])
        #####-----------------------------
        #####DNA to Amino
        #####-----------------------------          
        if(check_var1 == 1):
            #####-----------------------------
            #####DNA to RNA
            #####-----------------------------
            A = A.replace('T','U') 

            ####------------------------------
            #### RNA to amino Acid
            ####------------------------------                
            key = ''
            for k in range(0,3*math.floor(len(A)/3),3):
                key = A[k]+A[k+1]+A[k+2]
                ##print("here:",k, A[k],A[k+1],A[k+2],'\n')
                if(rna_to_amino.get(key) == None):
                    print('Data Corrupted for file...abanding:\n', g_fname[j])
                    #messagebox.showerror('Error', 'Data corrupted for file: '+g_fname[j]+' => expected only A,C,G and T in the DNA sequence')
                    ##root.destroy()
                    ##return()

                if(rna_to_amino[key] != '*'):
                    sequence[i].append(rna_to_amino[key])
        else :
            sequence[i] = list(A)

    i = i+1

#### Generating Output files

In [None]:
global out_file_path
nrof_samples = len(g_fname)
nrof_calc_metrics = int((nrof_samples*(nrof_samples-1))/2)

metric_arr = np.zeros(nrof_calc_metrics)
sample_loc_arr = np.chararray((nrof_calc_metrics,1,2), itemsize=100, unicode=True)
sample_number_arr = np.zeros((nrof_calc_metrics,3))
record_arr = np.chararray((nrof_calc_metrics), itemsize=100, unicode=True)
loc_arr = np.chararray((nrof_calc_metrics), itemsize=100, unicode=True)

k=0
for i in range(nrof_samples):
    for j in range((i+1),nrof_samples):
        if(len(sequence[i]) > len(sequence[j])):
            metric_arr[k] = run_corr_2files(i, j)
        else :
            metric_arr[k] = run_corr_2files(j, i)

        sample_loc_arr[k,0,0] = ntpath.basename(g_fname[i]).replace('.fasta','')
        sample_loc_arr[k,0,1] = ntpath.basename(g_fname[j]).replace('.fasta','')
        record_arr[k] = sample_loc_arr[k,0,0]+', '+sample_loc_arr[k,0,1]+' = '+str(metric_arr[k]) 
        loc_arr[k] = sample_loc_arr[k,0,0]+', '+sample_loc_arr[k,0,1]
        print(record_arr[k]+'\n')
        sample_number_arr[k,0] = i;
        sample_number_arr[k,1] = j;
        sample_number_arr[k,2] = metric_arr[k];
        k = k+1

#-----------
#OUtput file
#-----------
sort_metric_arr = np.sort(metric_arr, axis=0)
index = np.argsort(metric_arr, axis=0)
sort_record_arr = []
sort_loc_arr = []
for i in index:
    sort_record_arr.append(record_arr[i])
    sort_loc_arr.append(loc_arr[i])

filename_output= out_file_path+'/output_metrics1.csv'
filename_output_raw= out_file_path+'/output_raw_data1.csv'
filename_output_loc= out_file_path+'/output_loc1.txt'
filename_output_sample_number= out_file_path+'/sample_number1.txt'


In [None]:
filename_output

In [None]:
fileID = open(filename_output,'w');
fileID2 = open(filename_output_raw,'w');
fileID3 = open(filename_output_loc,'w');
fileID4 = open(filename_output_sample_number,'w');
for k in range(nrof_calc_metrics):    
    fileID.write(str(sort_metric_arr[k])+'\n')  
    fileID2.write(sort_record_arr[k]+'\n')   
    fileID3.write(sort_loc_arr[k]+'\n')
    fileID4.write(str(sample_number_arr[k,:])+'\n')

fileID.close()
fileID2.close()
fileID3.close()
fileID4.close()

### Ouput files Description

-------------------------------------------------------------------------------------


**output_metrics.txt:**

(M,1) vector,   where M = N* (N-1)/2   # of combinations of the input sequences and N is the number of input sequences N.

[metric ]
[metric ]



**output_raw_data.csv:**   

(M,3) vector
=
[seq1, seq2,metric]
[seq1, seq2,metric]
…


**output_loc.txt:**

(M,3) vector 
seq1,seq2]
[seq1,seq2]
..



**sample_number.txt:**


 (M,3) vector, with elements:  (combination index, ordinal number for seq1, ordinal number for seq2)
=
[1 1 2]
[2 1 2]
[3 1 3]
…

[m 2 3]
[m 2 4]
….

…

[n  N-1, N]

In [None]:
Image("/kaggle/input/task3-data/Images/Images/Res1.PNG")

In [None]:
Image("/kaggle/input/task3-data/Images/Images/Res2.PNG")

In [None]:
Image("/kaggle/input/task3-data/Images/Images/Res3.PNG")

In [None]:
Image("/kaggle/input/task3-data/Images/Images/Res4.PNG")

In [None]:
Image("/kaggle/input/task3-data/Images/Images/Res5.PNG")

# Section 1 : Summary

### Section 1: Real-time tracking of whole genomes to track variations of the virus over time


Summary:
•	To accomplish this task we follow the steps below: 
o	Retrieve the genome sequences from publicly available realtime databases such as China National GeneBank & GISAID EpiCoV™ Database
o	Use same algorithm from Task 3.2 to classify whether the new samples or sequences belong to exsiting or new groups 
o	Determine whether or not if new sequences are changing over time
•	This task tracks temporal variations 

•	Bootstrap confidence interval of the measured metric

•	Hierarchical classification of the sample sequence
This subtask relates to the evolution of the genome sequence in time.  To study these variations, we use the same approach as described in section 3.2.  The only two differences are as follows.  

1.	First, the input sequences must be specified for an individual geographical location, to remove the spatial dependency.  Also, these sequences must be specified for a given period of time and sorted by date in ascending order.  
2.	The algorithm of section 3.2 is then applied but is modified so that it only analyzes the consecutive pairs of sequences, e.g. 1st & 2nd sequences; 2nd & 3rd; 3rd & 4th, …, (Nth-1) & Nth, where N is the number of input sequences.

The results below show the genome evolution for China, from end of December 2019 to beginning of February 2020. Figure 1 shows the ordinate the measured metric (that estimates the difference between two consecutive sequences) and in the abscissa the time stamp of the sequence.   
Figure 1 shows some similar variation for the first three sequences, which were each taken about 10 days apart.  However, there is a larger variation for sequence 4th, with respect to 3rd and 5th, this could indicate that sequence 4th corresponds to a different strain of the virus within the same location of China. Finally, there is almost no variation in the last two sequences.


### Section 2: Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences

Summary:
•	Genome sequence conversion (obtained from publicly available realtime databases such as China National GeneBank & GISAID EpiCoV™ Database) to a random signal
•	Signal segmentation 
•	Signal correlation 
•	Results
•	This task tracks geographical variations

Task 3.2: Genome Sequence Conversion
In this step, the input genome sequence is converted to a random signal.  This is useful, since the new waveform is much easier to visualize and manipulate than letter patterns.  Furthermore, this allows us to use signal correlation in order to find how different or similar the two signals are.     

The Figure below shows the signal conversion for the two input sequences that will be analyzed. Each of the characters of the sequence is translated to a complex random number where the same random number is assigned to every character occurrence.

Task 3.2: Signal Segmentation 
Since the input sequences are about 30,000 samples long, the signal is divided in equal length segments.  The length of the segment selected in our study is 1024 but it can be changed to any given value.  Figure below shows how the signal is segmented.  In this example, we get 30 segments of size 1024 for the ~30,000 samples of the input sequence.

Task 3.2: Signal Correlation – Part 1

In order to perform this step, we do the following. 
Let s_1,1,s_1,2,s_1,3,…..,s_(1,n) be the n  segments of signal s_1, the signal corresponding to the first sequence and s_2,1,s_2,2,s_2,3,…..,s_(2,n) the n  segments of signal s_2 be the signal corresponding to the second sequence, then for each segment:

	Calculate the FFT of the signal to obtain the frequency signal (see Figure)
	S_1,1,S_1,2,S_1,3,…..,S_(1,n) for signal s_1
	S_2,1,S_2,2,S_2,3,…..,S_(2,n) for signal s_2
	Let S_1,1,S_1,2,S_1,3,…..,S_(1,n) be the n segments of signal S_1
	Let S_2,1,S_2,2,S_2,3,…..,S_(2,n) be the n segments of signal S_2
	Calculate the following vectors as shown in Figure 5
	R_11,1=S_1,1 〖 S〗_1,1^*  ,R_11,2=S_1,2 〖 S〗_1,2^*  ,…..R_(11,n)=S_(1,n) 〖 S〗_(1,n)^*     
	R_12,1=S_1,1 〖 S〗_2,1^*  ,R_12,2=S_1,2 〖 S〗_2,2^*  ,…..R_(12,n)=S_(1,n) 〖 S〗_(2,n)^*     
	Perform IFFT to obtain the autocorrelation and cross correlation vectors rxx and rxy (see Figure) and set rxx,ryy as the ith segment entry of the diagonal of matrix r (below figure)
	Then we calculate the correlation vectors as follows (see Figure 6):
	 r_11,1=〖IFFT(R〗_11,1),r_11,2=〖IFFT(R〗_11,2),…,r_(11,n)=〖IFFT(R〗_(11,n))   
	r_21,1=〖IFFT(R〗_21,1),r_21,2=〖IFFT(R〗_21,2),…,r_(21,n)=〖IFFT(R〗_(21,n))   

Task 3.2: Correlation Coefficients and Metrics

We obtain c_1,〖  c〗_2, which are given by the sum of the peak values of the correlation vectors over all the segments i=(1,…,n) for each sequence.  This is shown below
c_1=∑_(i=0)^n▒〖〖max⁡(r〗_(11,i))〗,    c_2=∑_(i=0)^n▒〖〖max⁡(r〗_(12,i))〗
Refer Figure 7.

Finally, we propose the following metric to measure the differences between a pair of sequences
m_12= (c_1/c_2 -1)

Furthermore, we calculate the α=90% confidence interval of  m_12 using bootstrap and if this value is greater than a given threshold, then the sequences have significant variations
  〖(m_12)〗_α>  T

Refer Figure 8.

Figure 9 illustrates the input sequences on the left and the associated sequence pair to perform the comparison.  The number of calculated metrics values is p=q(q-1)/2   Which corresponds to all the possible combinations of the input sequences.

Task 3.2: Algorithm 2: Genome Strains Classification
Figure 10 shows different Thresholds values T_1,T_2, T_3.
Which can be adjusted to see different degrees of variations.  Also, the number of thresholds needed could be optimized so that only significant variations are detected. In our case, we have selected three levels.  Finally Figure 10, shows the tree structure where each sequence is ordered according to their confidence interval metric estimates and Threshold levels.  The data is ordered in time from oldest to the top of the tree to the most recent to the bottom of the tree.  
The largest threshold level captures the most significant differences between a pair of sequences, the medium threshold level captures medium differences and the lower threshold level captures small differences between a pair of sequences.


### Pros and cons of the approach 

Task 3.1
This section describes the pros and cons for Task 3.1.

Pros
•	The results are very easy to interpret since it shows the difference between consecutive samples.  The measured data represents the time evolution for the specified geographical location. 

Cons
•	The input sequences need to be collected manually from the publicly available databases. It would be useful to have a tool that can do this automatically, by specifying the geographical location and dates one is interested to analyze.

4.2	Task 3.2
This section describes the pros and cons for Task 3.2.

Pros 
•	The method is flexible since it allows one to specify the branch tree depths, it does not need alignment in the input sequences, and it can be easily adapted to measure different features of the sequence.  We have analyzed samples from seven countries, and we have been able to identify different variations and their evolution paths.

Cons 
•	The variations that this method has identified may not coincide with the definition of a strain but can be adapted to satisfy the strain criteria. 

•	The current solution does not implement the sorting algorithm in the tree structure of the input data sequences.  This is now done manually.  For large number of input sequences this task is complex.  

•	A post-processing tool that classifies this data will make visualization and interpretation of results more efficient.

•	The results of our study are very limited since the number of samples used was rather small.  A much larger set of samples is needed to get more meaningful results.

## Section 3 : Evidence that livestock could be infected
#### 3.1
Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over: Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.

#### 3.2 
Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.: Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.

#### 3.3 
Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.: Experimental infections to test host range for this pathogen.

### Visualization Functions

In [None]:
def wordcloud_draw(text, color = 'white'):
    """
    Plots wordcloud of string text after removing stopwords
    """
    cleaned_word = " ".join([word for word in text.split()])
    wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color=color,
                      width=1000,
                      height=1000
                     ).generate(cleaned_word)
    plt.figure(1,figsize=(15, 15))
    plt.imshow(wordcloud)
    plt.axis('off')
    display(plt.show())

##### Load animal List

In [None]:
animals = []
animalList = []
with open('/kaggle/input/task3-data/animals.txt', "r") as f:
    animals = f.readlines()
animalList = [s.replace('\n', '') for s in animals]
animalList.append('pangolin')
animalList.append('mice')
animalList.append('animal')
animalList = [string for string in animalList if string != ""]
animalList = list(map(lambda x:x.lower(), animalList))
animalList.remove('human')
animalList.remove('discus')
pluralList = ['{0}s'.format(elem) for elem in animalList]
animalList = animalList + pluralList
#animalList = [' {0} '.format(elem) for elem in animalList]
animalList[0:5]

In [None]:
covid19_list = []
with open('/kaggle/input/task3-data/covid19.txt', "r") as f:
    words = f.readlines()
covid19_list = [s.replace('\n', '') for s in words]

In [None]:
covid19_list[0:5]

### Explore Articles

In [None]:
#Load Meta data
metadata = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
metadata.head()

In [None]:
#### Merging Articles with Meta data. Loading data from already merged csv

Left_join = pd.merge(fin_df,  
                     metadata,  
                     on ='sha',  
                     how ='left') 
Left_join = Left_join.drop(columns=['cord_uid', 'doi', 'title_y', 'pmcid', 'abstract_y', 'Microsoft Academic Paper ID', 'WHO #Covidence',
                       'full_text_file', 'url', 'pubmed_id'])
Left_join.head()

In [None]:
Left_join['title_x'].fillna("NoTitle", inplace = True)
Left_join['abstract_x'].fillna("NoAbstract", inplace = True)
Left_join['full_text'].fillna("NoText", inplace = True)
Left_join['combined'] = Left_join['title_x'] + ' ' + Left_join['abstract_x'] + ' ' + Left_join['full_text']


In [None]:
Left_join.head()

#### Explore Livestock articles

In [None]:
cond2 = Left_join['abstract_x'].str.contains('livestock')
print(sum(cond2))

In [None]:
abstract_livestock = Left_join[cond2]
abstract_livestock.shape
abstract_livestock.head(5)

#### Explore Livestock Articles related to farmer


In [None]:
cond3 = abstract_livestock['abstract_x'].str.contains('farmer')
abstract_livestock_farmer = abstract_livestock[cond3]
abstract_livestock_farmer.head(5)

## Section 4:  What do we know about virus genetics, origin, and evolution? What do we know about the virus origin and management measures at the human-animal interface?

###	Task 4.1- Animal host(s) and any evidence of continued spill-over to humans
###	Task 4.2- Sustainable risk reduction strategies
###	Task 4.3- Socioeconomic and behavioral risk factors for this spill-over



In [None]:
def clean_text(article):
    clean1 = re.sub(r'['+string.punctuation + '’—”'+']', "", article.lower())
    return re.sub(r'\W+', ' ', clean1)

### Tokenize combined data

The following cells takes time to execute. After execution, the file is saved as 'animal_articles.csv'. This file is loaded in again. Uncomment the cell if need to process again.



In [None]:
import re

#Left_join['tokenized'] = Left_join['combined'].map(lambda x: clean_text(x))

In [None]:

import nltk
'''
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
Left_join['tokenized'] = Left_join['tokenized'].apply(lemmatize_text)
'''

## Let's free up some space

In [None]:
## If we need space
#del Left_join


In [None]:
def find_spillover_wds(content, wds = ['transfer','spillover','pass on','transmit','contract',
                                       'distribute','progress','incubate','spread','disseminate','zoonosis','zoonotic']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found

In [None]:
def find_human_wds(content, wds = ['human','people','man','child','kid']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found

In [None]:
def find_covid_wds(content, wds = ['covid-19','covid','cov','coronavirus','corona']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found

In [None]:
def find_evidence_wds(content, wds = ['domestic animal', 'backyard*livestock', 'wet markets', 'meat markets','seafood markets', 'bites', 'bitten', 
                                      'laboratory*accident', 'fairs', 'petting*zoo',  'trading*wild animal', 
                                      'destruction*habitat', 'wild animal*food', 'lifestock pathogens','genetic mutations',
                                      'animal testing', ' hunting ', 'industrial*farming', ' pet ', 'butcher', ' eat ', ' meat ']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ','
    return found

In [None]:
'''
Left_join['evidence_wds'] = Left_join['combined'].apply(find_evidence_wds)
Left_join.evidence_wds.unique()
'''

### 4.1 Animal Source (Task 4.1)
In this step, we took the narrowed down list of 8000 articles and applied an additional level of precision to the search. To prevent skewing of the analysis due to irrelevant terms, we began with pulling together a manually curated list of 22 animals that were most likely related to the origin and spread of COVID-19.
The next logical step was to determine which topics each of the 22 animals were related to. For this purpose, we identified the most frequently used words and phrases for each animal term and demonstrated the relative frequency through wordcloud visualizations. 
We also observed that many articles often referenced a pair of animals in describing the traits of the disease. Based on this, we built a circos-plot visualization that would represent which animal pairs appeared together most frequently in the articles.


In [None]:
'''
Left_join['evidence_wds'].replace('', np.nan, inplace=True)


Left_join['animal_wds'] = Left_join['tokenized'].apply(find_animal_wds)
Left_join['spillover_wds'] = Left_join['tokenized'].apply(find_spillover_wds)
Left_join['human_wds'] = Left_join['tokenized'].apply(find_human_wds)
Left_join['virus_wds'] = Left_join['tokenized'].apply(find_covid_wds)
Left_join['evidence_wds'] = Left_join['combined'].apply(find_evidence_wds)

Left_join['animal_wds'] = Left_join['animal_wds'].str.replace('discus', '')
Left_join['animal_wds'].replace('', np.nan, inplace=True)
Left_join['spillover_wds'].replace('', np.nan, inplace=True)
Left_join['human_wds'].replace('', np.nan, inplace=True)
Left_join['virus_wds'].replace('', np.nan, inplace=True)
Left_join['evidence_wds'].replace('', np.nan, inplace=True)

articlesAnimal = Left_join[Left_join[['animal_wds', 'spillover_wds', 'human_wds', 'virus_wds','evidence_wds']].notnull().all(1)]
articlesAnimal['animal_wds'] = articlesAnimal['animal_wds'].map(lambda x: clean_text(x))
articlesAnimal['animal_wds'].replace(' ', np.nan, inplace=True)
articlesAnimal = articlesAnimal.dropna(subset=['animal_wds'])
articlesAnimal
'''

In [None]:
'''
print('Total articles containing animal keywords: ' + str(len(articlesAnimal)))
print('Total articles available: ' + str(len(fin_df)))
print(str(float(len(articlesAnimal)/len(fin_df)*100)) + '% of the articles available contains animal keywords')
'''

In [None]:
'''
def find_country_wds(content, wds = ['transfer','spillover','pass on','transmit','contract',
                                       'distribute','progress','incubate','spread','disseminate','zoonosis']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found
'''

In [None]:
#articlesAnimal.to_csv('animal_articles.csv', sep=',', encoding='utf-8')

#### Since we have saved 'animal_articles.csv', in the previous step, we can delete 'Left_join' to free up space. We only need 'animal_articles' going forward

In [None]:
del Left_join

### Import Processed Data

In [None]:
articlesAnalysis = pd.read_csv("/kaggle/input/task3-data/animal_articles.csv")
#articlesAnalysis['combined'] = articlesAnalysis['combined'].str.replace(r'\b(\w{1,2})\b', '')
articlesAnalysis.head()

In [None]:
# Join the different processed titles together.
animal_string = ''.join(list(articlesAnalysis['animal_wds'].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000, contour_width=3, contour_color='steelblue', collocations=False)
# Generate a word cloud
wordcloud.generate(animal_string)
# Visualize the word cloud
wordcloud.to_image()

## Spillover Analysis

In [None]:
def find_animal_wds(content, wds = [
 ' dog ',
 ' cat ',
 'pig',
 'mouse',
 'bird',
 ' bat ',
 'horse',
 ' rat ',
 'sheep',
 'chicken',
 'rabbit',
 'insect',
 'goat',
 'monkey',
 'fox',
 'fish',
 'cow',
 'ferret',
 'deer',
 'fly',
 'raccoon',
 'camel',
 'hamster',
 'bear',
'pangolin',
'tiger']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ','
            if found.count(',') == 5:
                break
    return found

In [None]:
articlesAnalysis['animal_wds'] = articlesAnalysis['combined'].apply(find_animal_wds)
articlesAnalysis['animal_wds'] = articlesAnalysis['animal_wds'].str.replace(',', ' ')
articlesAnalysis['animal_wds'].replace('', np.nan, inplace=True)
articlesAnalysis = articlesAnalysis.dropna(subset=['animal_wds'])
articlesAnalysis

In [None]:
# Join the different processed titles together.
animal_string = ''.join(list(articlesAnalysis['animal_wds'].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000, contour_width=3, contour_color='steelblue', collocations=False)
# Generate a word cloud
wordcloud.generate(animal_string)
# Visualize the word cloud
wordcloud.to_image()

In [None]:
def plot_50_most_common_words(count_data, count_vectorizer):
    import matplotlib.pyplot as plt
    words = count_vectorizer.get_feature_names()
    total_counts = np.zeros(len(words))
    for t in count_data:
        total_counts+=t.toarray()[0]
    
    count_dict = (zip(words, total_counts))
    count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:20]
    words = [w[0] for w in count_dict]
    counts = [w[1] for w in count_dict]
    x_pos = np.arange(len(words)) 
    
    plt.figure(2, figsize=(10, 8/1.6180))
    plt.subplot(title='20 most common words')
    sns.set_context("notebook", font_scale=1.25, rc={"lines.linewidth": 2.5})
    sns.barplot(x_pos, counts, palette='husl')
    plt.xticks(x_pos, words, rotation=90) 
    plt.xlabel('words')
    plt.ylabel('counts')
    plt.show()

In [None]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
count_data2 = count_vectorizer.fit_transform(articlesAnalysis['animal_wds'])
# Visualise the 50 most common words
plot_50_most_common_words(count_data2, count_vectorizer)

In [None]:
import re
def listToString(s):  
    # initialize an empty string 
    str1 = " " 
    # return string   
    return (str1.join(s)) 

articlesAnalysisTiger= articlesAnalysis[articlesAnalysis['animal_wds'].str.contains("tiger")]

articlesAnalysisTiger['animal_sentence'] = ''
for i in range(0,len(articlesAnalysisTiger)):
    articlesAnalysisTiger['animal_sentence'].iloc[i] = listToString(re.findall(r"([^.]*? tiger[^.]*\.)",'.' + articlesAnalysisTiger['combined'].iloc[i])).lower()

articlesAnalysisTiger

In [None]:
import re
def listToString(s):  
    # initialize an empty string 
    str1 = " " 
    # return string   
    return (str1.join(s)) 

articlesAnalysisBat= articlesAnalysis[articlesAnalysis['animal_wds'].str.contains("bat")]

articlesAnalysisBat['animal_sentence'] = ''
for i in range(0,len(articlesAnalysisBat)):
    articlesAnalysisBat['animal_sentence'].iloc[i] = listToString(re.findall(r"([^.]*? bats [^.]*\.)",'.' + articlesAnalysisBat['combined'].iloc[i])).lower()

articlesAnalysisBat

In [None]:
import re
def listToString(s):  
    # initialize an empty string 
    str1 = " " 
    # return string   
    return (str1.join(s)) 

articlesAnalysisPangolin= articlesAnalysis[articlesAnalysis['combined'].str.contains("pangolin")]

articlesAnalysisPangolin['animal_sentence'] = ''
for i in range(0,len(articlesAnalysisPangolin)):
    articlesAnalysisPangolin['animal_sentence'].iloc[i] = listToString(re.findall(r"([^.]*? pangolin[^.]*\.)",'.' + articlesAnalysisPangolin['combined'].iloc[i])).lower()

articlesAnalysisPangolin

In [None]:
import re
def listToString(s):  
    # initialize an empty string 
    str1 = " " 
    # return string   
    return (str1.join(s)) 

articlesAnalysisCat= articlesAnalysis[articlesAnalysis['animal_wds'].str.contains("cat")]

articlesAnalysisCat['animal_sentence'] = ''
for i in range(0,len(articlesAnalysisCat)):
    articlesAnalysisCat['animal_sentence'].iloc[i] = listToString(re.findall(r"([^.]*? cat [^.]*\.)",'.' + articlesAnalysisCat['combined'].iloc[i])).lower()

articlesAnalysisCat

In [None]:
articlesAnalysisTigerCov = articlesAnalysisTiger[articlesAnalysisTiger['animal_sentence'].str.contains("cov" or "corona")]
articlesAnalysisTigerCov

In [None]:
# Join the different processed titles together.
animal_string = ''.join(list(articlesAnalysisBat['animal_sentence'].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000, contour_width=3, contour_color='steelblue', collocations=False)
# Generate a word cloud
wordcloud.generate(animal_string)
# Visualize the word cloud
wordcloud.to_image()

In [None]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(articlesAnalysisBat['animal_sentence'])
# Visualise the 50 most common words
plot_50_most_common_words(count_data, count_vectorizer)

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
 
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(count_data)

docs_test=articlesAnalysisBat['animal_sentence'].tolist()

def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
 
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]
 
    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results
feature_names=count_vectorizer.get_feature_names()
 
# get the document that we want to extract keywords from
doc=listToString(docs_test)
 
#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(count_vectorizer.transform([doc]))
 
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
 
#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)
 
# now print the results
print("\n=====Doc=====")
print(doc)
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


In [None]:
from collections import Counter

evidence = ['backyard*livestock', 'wet markets', 'meat markets','seafood markets', 'bit*', 
                                      'laboratory*accident', 'fairs', 'petting*zoo',  'trading*wild animal', 
                                      'destruction*habitat', 'wild animal*food', 'lifestock pathogens','genetic mutations',
                                      'animal testing', ' hunting ', 'industrial*farming', ' pet ', 'butcher', ' eat ', ' meat ']
from operator import itemgetter
articlesAnalysis['evidence_wds'] = articlesAnalysis['evidence_wds'].str.replace('domestic animal','')
flat_list = [item for sublist in articlesAnalysis['evidence_wds'].str.split(',') for item in sublist]
flat_list.remove('')
test_dict = Counter(flat_list)
del test_dict['']

In [None]:
test_dict

In [None]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')


#explsion
explode = (0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05)
 
plt.pie(test_dict.values(), labels=test_dict, autopct='%1.1f%%', startangle=90, pctdistance=0.85, explode = explode)
#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
# Equal aspect ratio ensures that pie is drawn as a circle
plt.title('Spillover Source')
plt.tight_layout()
plt.show()


In [None]:
articlesAnalysis['animal_wds'] = articlesAnalysis['animal_wds'].str.replace('animal','')
#articlesAnalysisAnimal = articlesAnalysis[articlesAnalysis['animal_wds'] == "bat "]
#articlesAnalysisAnimal = articlesAnalysis[articlesAnalysis['animal_wds'].str.contains("dog")]
articlesAnalysisAnimal = articlesAnalysis[articlesAnalysis['evidence_wds'].str.contains("meat")]
articlesAnalysisAnimal = articlesAnalysisAnimal[articlesAnalysisAnimal['evidence_wds'].str.contains(" eat ")]
articlesAnalysisAnimal = articlesAnalysisAnimal[articlesAnalysisAnimal['spillover_wds'].str.contains("zoono")]
articlesAnalysisAnimal

In [None]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
count_data2 = count_vectorizer.fit_transform(articlesAnalysisAnimal['animal_wds'])
# Visualise the 50 most common words
plot_50_most_common_words(count_data2, count_vectorizer)

In [None]:
import matplotlib as mpl
from matplotlib.pyplot import figure
mpl.rcParams['font.size'] = 9.0
figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')

flat_list = [item for sublist in articlesAnalysisAnimal['evidence_wds'].str.split(',') for item in sublist]
flat_list.remove('')
test_dict = Counter(flat_list)
del test_dict['']

#explsion
explode = (0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05)
 
plt.pie(test_dict.values(), labels=test_dict, autopct='%1.1f%%', startangle=90, pctdistance=0.85, explode = None, textprops={'fontsize': 8})
#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
# Equal aspect ratio ensures that pie is drawn as a circle
plt.title('Spillover Source')
plt.tight_layout()
plt.show()

In [None]:
import datetime as dt
articlesAnalysis['publish_time'] = pd.to_datetime(articlesAnalysis['publish_time'])
articlesAnalysis.head()

In [None]:
import datetime as dt

articlesAnalysis3 = articlesAnalysis[articlesAnalysis['evidence_wds'].str.contains(" eat ")]
articlesAnalysis3 = articlesAnalysis3[articlesAnalysis3['publish_time'].dt.year > 2018]
articlesAnalysis3 = articlesAnalysis3[~articlesAnalysis3['evidence_wds'].str.contains("bit")]
articlesAnalysis3 = articlesAnalysis3[articlesAnalysis3['spillover_wds'].str.contains("zoono")]

articlesAnalysis3


In [None]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
count_data4 = count_vectorizer.fit_transform(articlesAnalysis3['animal_wds'])
# Visualise the 50 most common words
plot_50_most_common_words(count_data4, count_vectorizer)

In [None]:
articlesAnalysis2 = articlesAnalysis[articlesAnalysis['combined'].str.contains("zoono")]
articlesAnalysis2

## LDA 

In [None]:

from sklearn.decomposition import LatentDirichletAllocation as LDA

# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(articlesAnalysis2['combined']) 
# Helper function
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)


In [None]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(articlesAnalysis2['combined'])
# Visualise the 50 most common words
plot_50_most_common_words(count_data, count_vectorizer)

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
 
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(count_data)

In [None]:
docs_test=articlesAnalysis2['combined'].tolist()

In [None]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
 
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]
 
    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

In [None]:
feature_names=count_vectorizer.get_feature_names()
 
# get the document that we want to extract keywords from
doc=docs_test[0]
 
#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(count_vectorizer.transform([doc]))
 
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
 
#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)
 
# now print the results
print("\n=====Doc=====")
print(doc)
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])
 

In [None]:
def find_africa_wds(content, wds = ['Africa','Burundi', 'Comoros', 'Djibouti', 'Eritrea', 'Ethiopia', 'Kenya', 'Madagascar', 'Malawi', 'Mauritius', 'Mayotte’,  ‘Mozambique', 'Reunion', 'Rwanda', 'Seychelles', 'Somalia', 'Tanzania', 'United Republic of Uganda', 'Zambia', 'Zimbabwe', 'Angola', 'Cameroon', 'Chad', 'Congo', 'Algeria', 'Egypt', 'Libyan Arab Jamahiriya', 'Morroco', 'South Sudan', 'Sudan', 'Tunisia', 'Western Sahara', 'Botswana', 'Eswatini’, ’Swaziland', 'Lesotho', 'Namibia', 'South Africa', 'Benin', 'Burkina Faso', 'Cape Verde', 'Ivory Coast', 'Gambia', 'Ghana', 'Guinea', 'Guinea-Bissau', 'Liberia', 'Mali', 'Mauritania', 'Niger', 'Nigeria', 'Saint Helena', 'Senegal', 'Sierra Leone', 'Togo']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found

In [None]:
def find_asia_wds(content, wds = ['Asia','Afganistan', 'Armenia', 'Azerbaijan', 'Bangladesh', 'Bhutan', 'Brunei Darussalam', 'Cambodia', 'China', 'Georgia', 'Hong Kong', 'India', 'Indonesia', 'Japan', 'Kazakhstan', 'North Korea’, “South Korea', 'Kyrgyzstan', 'Laos', 'Macao', 'Malaysia', 'Maldives', 'Mongolia', 'Myanmar', 'Nepal', 'Pakistan', 'Phillipines', 'Singapore', 'Sri Lanka', 'Taiwan', 'Tajikistan', 'Thailand', 'Turkmenistan', 'Uzbekistan', 'Vietnam']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found

In [None]:
def find_america_wds(content, wds = ['Bermuda', 'Canada', 'Greenland', 'United States', 'U.S.A.', 'USA', 'US', 'Argentina', 'Bolivia', 'Brazil', 'Chile', 'Colombia', 'Ecuador', 'Guyana', 'Paraguay', 'Peru', 'Uruguay', 'Venezuela', 'Anguilla', 'Antigua', 'Barbuda', 'Aruba', 'Bahamas', 'Barbados', 'Bonaire', 'British Virgin Islands', 'Cayman Islands', 'Cuba', 'Curaçao', 'Dominican Republic', 'Grenada', 'Guadeloupe', 'Haiti', 'Jamaica', 'Martinique', 'Monserrat', 'Puerto Rico', 'Saint Lucia', 'Saint Martin', 'Saint Vincent and the Grenadines', 'Sint Maarten', 'Trinidad and Tobago', 'Turks and Caicos Islands', 'Virgin Islands (US)', 'Belize', 'Costa Rica', 'El Salvador', 'Guatemala', 'Honduras', 'Mexico', 'Nicaragua', 'Panama']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found

def find_europe_wds(content, wds = ['Albania', 'Andorra', 'Belarus', 'Bosnia', 'Croatia', 'European Union', 'Faroe Islands', 'Gibraltar’,  ‘Iceland', 'Jersey', 'Kosovo', 'Liechtenstein', 'Moldova', 'Monaco', 'Montenegro', 'North Macedonia', 'Norway', 'Russia', 'San Marino', 'Serbia', 'Switzerland', 'Turkey', 'Ukraine']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found

def find_middleeast_wds(content, wds = ['Bahrain', 'Iraq', 'Iran', 'Israel', 'Jordan', 'Kuwait', 'Lebanon', 'Oman', 'Palestine', 'Qatar', 'Saudi Arabia', 'Syria', 'United Arab Emirates', 'Yemen']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found

def find_oceania_wds(content, wds = ['Australia', 'Fiji', 'French Polynesia', 'Guam', 'Kiribati', 'Marshall Islands', 'Micronesia', 'New Caledonia', 'New Zealand', 'Papua New Guinea', 'Samoa', 'Samoa, American', 'Solomon, Islands', 'Tonga', 'Vanuatu']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ' '
    return found

In [None]:
articlesAnalysis['africa'] = articlesAnalysis['combined'].apply(find_africa_wds)
articlesAnalysis['africa'].replace('', np.nan, inplace=True)
articlesAnalysis['asia'] = articlesAnalysis['combined'].apply(find_asia_wds)
articlesAnalysis['asia'].replace('', np.nan, inplace=True)
articlesAnalysis['america'] = articlesAnalysis['combined'].apply(find_america_wds)
articlesAnalysis['america'].replace('', np.nan, inplace=True)
articlesAnalysis['europe'] = articlesAnalysis['combined'].apply(find_europe_wds)
articlesAnalysis['europe'].replace('', np.nan, inplace=True)
articlesAnalysis['middleeast'] = articlesAnalysis['combined'].apply(find_middleeast_wds)
articlesAnalysis['middleeast'].replace('', np.nan, inplace=True)
articlesAnalysis['oceania'] = articlesAnalysis['combined'].apply(find_oceania_wds)
articlesAnalysis['oceania'].replace('', np.nan, inplace=True)
articlesAnalysis

articlesAnalysisAmerica = articlesAnalysis.dropna(subset=['america'])
articlesAnalysisAmerica

articlesAnalysisUSA = articlesAnalysisAmerica[articlesAnalysisAmerica['america'].str.contains('USA' or 'United States')]
articlesAnalysisUSA = articlesAnalysisUSA[articlesAnalysisUSA[['africa', 'asia', 'europe', 'middleeast','oceania']].isnull().all(1)]
articlesAnalysisUSA

In [None]:
flat_list = [item for sublist in articlesAnalysisUSA['evidence_wds'].str.split(',') for item in sublist]
flat_list.remove('')
test_dict = Counter(flat_list)
del test_dict['']
test_dict

In [None]:
figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
#explsion
explode = (0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05)
 
plt.pie(test_dict.values(), labels=test_dict, autopct='%1.1f%%', startangle=90, pctdistance=0.85, explode = None)
#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
# Equal aspect ratio ensures that pie is drawn as a circle
plt.title('Spillover Source')
plt.tight_layout()
plt.show()

In [None]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
count_data4 = count_vectorizer.fit_transform(articlesAnalysisUSA['animal_wds'])
# Visualise the 50 most common words
plot_50_most_common_words(count_data4, count_vectorizer)

### 4.2 Risk Reduction

Risk Reduction Strategies (Task 4.2)
Sustainable risk mitigation strategies are critical to prevent future instances of similar viral outbreaks involving animals. Research and implementation of such strategies would require very specific data mining using keywords related to risk.
•	Risk reduction strategy key search terms: education, economy, socioeconomic, unemployment, school closure, population density, religion, discrimination, behavioral risk, racism, financial status

Additional weight is added to sentences with “risk reduction” phrase in them. 


In [None]:
### Filtering Dataset

articlesAnalysisRiskReduction = articlesAnalysis[articlesAnalysis['combined'].str.contains('risk reduction')]
articlesAnalysisRiskReduction['risk_reduction_sentence'] = ''
for i in range(0,len(articlesAnalysisRiskReduction)):
    articlesAnalysisRiskReduction['risk_reduction_sentence'].iloc[i] = listToString(re.findall(r"([^.]*?[^.]*?risk reduction[^.]*\.[^.]*\.)",'.' + articlesAnalysisRiskReduction['combined'].iloc[i])).lower()
articlesAnalysisRiskReduction['risk_reduction_sentence'].replace('', np.nan, inplace=True)
articlesAnalysisRiskReduction = articlesAnalysisRiskReduction.dropna(subset=['risk_reduction_sentence'])
articlesAnalysisRiskReduction

In [None]:
### Keywords Visualization

# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
count_data10 = count_vectorizer.fit_transform(articlesAnalysisRiskReduction['risk_reduction_sentence'])
# Visualise the 50 most common words
plot_50_most_common_words(count_data10, count_vectorizer)

# Join the different processed titles together.
animal_string = ''.join(list(articlesAnalysisRiskReduction['risk_reduction_sentence'].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000, contour_width=3, contour_color='steelblue', collocations=False)
# Generate a word cloud
wordcloud.generate(animal_string)
# Visualize the word cloud
wordcloud.to_image()

### 4.3 Socioeconomic Risk

4.3. Socioeconomic and Behavioral Risk Factors (Task 4.3)
In order to analyze socioeconomic impacts of the potential future transfer of the virus from animals to humans, further keywords were utilized to further refine search results.
•	Socioeconomic terms used:  socioeconomic, behavioral, economy, population density, school closure, daily wage, crowd, wet market, discrimination, racism, religion, financial status, education

Additional weight is added to sentences with “socioeconomic” and “behavioral” word in them. 


In [None]:
### Filtering Dataset

def find_socioeconomic_wds(content, wds = ['socioeconomic','economy', 'behavioral risk', 'unemployment', 'population density', 'school closure', 'daily wage', 'discrimination', 'racism', 'religion', 'financial status', 'education']):
    found = ''
    for w in wds:
        if w in content:
            found += w + ', '
    return found

articlesAnalysis['social_wds'] = articlesAnalysis['combined'].apply(find_socioeconomic_wds)
articlesAnalysis['social_wds'].replace('', np.nan, inplace=True)
articlesAnalysisSocial = articlesAnalysis.dropna(subset=['social_wds'])
articlesAnalysisSocio = articlesAnalysisSocial[articlesAnalysisSocial['social_wds'].str.contains('socioeconomic')]
articlesAnalysisSocio

In [None]:
### Keywords Visualization

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')

from operator import itemgetter
flat_list = [item for sublist in articlesAnalysisSocial['social_wds'].str.split(', ') for item in sublist]
test_dict = Counter(flat_list)
del test_dict['']

#explsion
explode = (0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05)
 
plt.pie(test_dict.values(), labels=test_dict, autopct='%1.1f%%', startangle=90, pctdistance=0.85, explode = None, textprops={'fontsize': 8})
#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
# Equal aspect ratio ensures that pie is drawn as a circle
plt.title('Socioeconomic Keywords')
plt.tight_layout()
plt.show()

In [None]:
test_dict

In [None]:
import re
def listToString(s):  
    # initialize an empty string 
    str1 = " " 
    # return string   
    return (str1.join(s)) 

articlesAnalysisSocio['socioeconomic_sentence'] = ''
for i in range(0,len(articlesAnalysisSocio)):
    articlesAnalysisSocio['socioeconomic_sentence'].iloc[i] = listToString(re.findall(r"([^.]*?socioeconomic[^.]*\.[^.]*\.[^.]*\.[^.]*\.)",'.' + articlesAnalysisSocio['combined'].iloc[i])).lower()
articlesAnalysisSocio['socioeconomic_sentence'].replace('', np.nan, inplace=True)
articlesAnalysisSocialEco = articlesAnalysisSocio.dropna(subset=['socioeconomic_sentence'])
articlesAnalysisSocialEco

articlesAnalysisSocio['socioeconomic_sentence'].iloc[0]

# Join the different processed titles together.
animal_string = ''.join(list(articlesAnalysisSocio['socioeconomic_sentence'].values))
# Create a WordCloud object
wordcloud = WordCloud(width = 600, height = 400, background_color="white", max_words=5000, contour_width=3, contour_color='steelblue', collocations=False)
# Generate a word cloud
wordcloud.generate(animal_string)
# Visualize the word cloud
wordcloud.to_image()

## Section 5 : Retrieve Relevant articles using BERT

In the spirit of building a scalable and reusable solution, we utilized BERT that allows a user to repurpose the code to find relevant articles for any phrase. As an output, the code will return the relevant articles along with the summary and a similarity score as a measure of relevance.

fin_df = pd.read_csv('/kaggle/input/task3-data/fin_df.csv')
fin_df.head()

In [None]:

fin_df = pd.read_csv('/kaggle/input/task3-data/fin_df.csv')
fin_df.head()

fin_df.isnull().sum()

In [None]:
fin_df['title'].fillna("NoTitle", inplace = True)
fin_df['abstract'].fillna("NoAbstract", inplace = True)
fin_df['full_text'].fillna("NoText", inplace = True)
fin_df['combined'] = 'Title : '+fin_df['title'] + '; Abstract ' + fin_df['abstract'] + '; Full Text ' + fin_df['full_text']

In [None]:


# Load the BERT model. Various models trained on Natural Language Inference (NLI) https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md and 
# Semantic Textual Similarity are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md

model = SentenceTransformer('bert-base-nli-mean-tokens')

### Word embeddings on full article text

In [None]:
#Uncomment line below if you want to embed
#content = [re.sub(' \n\n\n ','',x) for x in fin_df['full_text'].to_list()]


*** The Cell below Encodes the articles. It takes quite a while to run. The encoded embedding is then saved as a pickle file. For Testing on New queries, please load the pickle file.
Go To section 'Test New Queries'
***

### NOTE : THE CELL BELOW WILL TAKE A LONG TIME TO EXECUTE. SKIP AND LOAD SAVED MODEL


In [None]:
### Embed the article contents
'''
embedding = model.encode(content, show_progress_bar=True)
'''

In [None]:
#Save embeddings
'''

with open('full_text_embeddings.pkl', 'wb') as embed:
    pickle.dump(embedding, embed)
'''

### Load Picked Embedding file

In [None]:
with open('/kaggle/input/task3-data/full_text_embeddings.pkl','rb') as f:
    embedding = pickle.load(f)

### List Questions

In [None]:
queries = ['How is livestock affected due to Corona virus?',
          'How are farmers afftected due to Coronoa vurus?',
          'How is the spread of corona virus?',
          'Has corona virus infected animals?',
          'How does corona virus transfer?']

### Embed Queries

In [None]:
query_embeddings = model.encode(queries)

In [None]:
fin_df.head()

In [None]:
df = pd.DataFrame(columns=['Query','Cosine Similarity','Summary','Article Full Text'])

In [None]:

import scipy as sc
top_n_selects = 1
for query, query_embedding in zip(queries, query_embeddings):
    distances = sc.spatial.distance.cdist([query_embedding], embedding, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])
    
    
    
    print('Query : ',query)
    print('###########################################')
    
    for idx, distance in results[0:top_n_selects]:
        print("\nCosine Similarity (Score: %.4f)" % (1-distance),"\n")
        
        body = fin_df['combined'][idx].strip() 
        summary_model = Summarizer()
        result = summary_model(body, min_length=60)
        summary = ''.join(result)
        print('Summary:',summary)
        print('\n')
        print('Article:',body)
        
        similarity = (1-distance)
        to_append = [query, similarity, summary, body]
        a_series = pd.Series(to_append, index = df.columns)
        df = df.append(a_series, ignore_index=True)

        
        print('_________________________________________')

In [None]:
df.head()

## Search Based on New Query

In [0]:
'''
queries = []
query = input('Enter you query:')
queries.append(query)
queries
'''

In [0]:
# Load the BERT model. Various models trained on Natural Language Inference (NLI) https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md and 
# Semantic Textual Similarity are available https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md

#model = SentenceTransformer('bert-base-nli-mean-tokens')

### Embed the query

In [0]:
#query_embeddings = model.encode(queries)

## Load the Pickled embedding file

In [0]:
'''
with open('/kaggle/input/task3-data/full_text_embeddings.pkl','rb') as f:
    embedding = pickle.load(f)
'''

## Retrieve Questions based on new question 

In [0]:
#summary_model = Summarizer()

In [0]:
#df = pd.DataFrame(columns=['Query','Cosine Similarity','Summary','Article'])

In [0]:
'''
top_n_selects = 2
import scipy as sc

for query, query_embedding in zip(queries, query_embeddings):
    distances = sc.spatial.distance.cdist([query_embedding], embedding, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print('Query : ',query)
    print('###########################################')
    
    for idx, distance in results[0:top_n_selects]:
        print("\nCosine Similarity (Score: %.4f)" % (1-distance),"\n")
        
        body = fin_df['combined'][idx].strip() 
        summary_model = Summarizer()
        result = summary_model(body, min_length=60)
        summary = ''.join(result)
        similarity = (1-distance)
        to_append = [query, similarity, summary, body]
        a_series = pd.Series(to_append, index = df.columns)
        df = df.append(a_series, ignore_index=True
'''

In [0]:
#df.head()

## Summary

The medical and science community is at the fore-front of the fight against COVID-19, and as technologists the best we can do is to enable and empower them in this critical endeavor. With this motivation, our team has strived to design and deliver a solution that makes existing knowledge more readily accessible through the use of technology. The project required many crucial design and approach decisions and we relied on a few key tenets for guidance - practicality, user-friendliness and scalability. As we deliver this project, we hope it adds value in this battle against Corona virus.


## Reference 

#### External References 
1. Center for Disease Prevention (CDC)

2. NextstrainNextstrain is an open-source project to harness the scientific and public health  potential  of  pathogen  genome  data.  We  provide  a continually updatedview of publicly available data alongside powerful analytic and visualization tools for use by the community. 

3. China National GeneBankFrom the web page of the China National GeneBank, we can get the full coronavirus genome sequence. They have data from 2020-01-03 to 2020-03-15

4. GISAID EpiCoV™ DatabaseGISAID  EpiCoV™  Database  Until  2020-03-27  01:16:12  CST,  the  total number of viruses is 1836.  China National GeneBank DataBase (CNGBdb) is an official partner of the GISAID Initiative. It provides access to EpiCoV 
2020-04-05PA1UenRev PA1©Ericsson AB 202025(30)Commercial in Confidenceand features the most complete collection of hCoV-19 genome sequences along with related clinical and epidemiological data

5. China National GeneBankoFrom the web page of the China National GeneBank, we can get the  full  coronavirus  genome  sequence.  They  have  data  from 2020-01-03 to 2020-03-15

6. GISAID EpiCoV™ DatabaseoGISAID  EpiCoV™  Database  Until  2020-03-27  01:16:12  CST, the total number of viruses is 1836.  China National GeneBank DataBase   (CNGBdb)   is   an   official   partner   of   the   GISAID Initiative.  It  provides  access  to  EpiCoV  and  features  the  most complete  collection  of  hCoV-19  genome  sequences  along  with related clinical and epidemiological data.

7.	BERT: https://arxiv.org/abs/1810.04805

8.	BERT Code: https://github.com/google-research/bert

9.	Scibert: https://arxiv.org/pdf/1903.10676.pdf

10.	Cosine Similarity: https://en.wikipedia.org/wiki/Cosine_similarity

11.	BERT Summarizer: https://pypi.org/project/bert-extractive-summarizer/

12.	BERT Summarizer : https://arxiv.org/abs/1906.04165

13.	https://github.com/huggingface/neuralcoref
