#  COVID-19 Risk Factor Analysis - Task 2
 
## Introduction:
  
The notebook is for task 2: risk factor analysis for covid-19. The main purpose is to generate most relavanet research papers based on given questions which are related to risk factors for COVID-19. The risk factors for COVID-19 contains the following aspects:

- Data on potential risks factors
- Smoking, pre-existing pulmonary disease
- Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
- Neonates and pregnant women
- Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
- Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
- Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
- Susceptibility of populations
- Public health mitigation measures that could be effective for control

We build multiple solutions to answer these questions. The model in this notebook contains:

- Top-N recommended papers based on Word2Vec key words similarity: we trained multiple word2vec models that are used to   find the most similar word (tokens) for given question and match the most relevant content in data base (papers)
   

- Topic modeling using LDA: use LDA topic modeling, we generate the relevant scores for each risk factors given papers
  ONE STEP TODO: generate most relevant articles

- Classification based supervised RNN: we also leverage w2v models we have trained to label (assign) each risk factors on given paper. We then train a RNN (recurrent neural network) to learn the classes in supervised learning. The output will be the most revevant papers with details including paper id, title, abstract and its short summary. 

** The notebook is dependent on package(sumy) which will be installed using pip install. So please turn on the internet options from the kernel settings before running the notebook. **


 # 0. Install some libs
 

In [None]:
!pip install sumy

# 1. Load libs

- load all neccesary libs used in the analysis
- Select Risk Factor if you want to search for specific risk factor question listed in this section

In [None]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import gensim
import re
import os
import json
import os
import glob
import gc
import pickle
import sys
import tensorflow as tf
import nltk
import matplotlib.pyplot as plt
import matplotlib
import time
import spacy

import pyLDAvis
import pyLDAvis.gensim 
import logging
import seaborn as sns
import wordcloud

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)


from tqdm import tqdm
from copy import deepcopy
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow import keras
from tensorflow.keras import layers

from collections import OrderedDict
from copy import deepcopy
from pathlib import Path
from abc import ABC, abstractmethod
from typing import List, Dict
from collections import Counter

import gensim
import gensim.corpora as corpora
from gensim.models import FastText
from gensim.corpora import Dictionary
from gensim.models import LdaModel, LdaMulticore,CoherenceModel
from gensim.models.fasttext import FastText


from sklearn.manifold import TSNE
from nltk.tokenize import RegexpTokenizer,word_tokenize,sent_tokenize
from nltk.corpus import stopwords,wordnet
from nltk.stem import WordNetLemmatizer,SnowballStemmer
from nltk.stem.porter import *
from nltk import ngrams
from nltk import pos_tag

from nltk import WordPunctTokenizer
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from pprint import pprint
from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from scipy.spatial.distance import cosine
from nltk.stem import PorterStemmer
from scipy.spatial import distance
from matplotlib import pyplot as plt
from wordcloud import WordCloud

from string import punctuation
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
 
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer

from gensim.summarization.summarizer import summarize
from sumy.summarizers.lex_rank import LexRankSummarizer 

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation,GRU, Conv1D,GlobalAveragePooling2D,SpatialDropout1D
from keras.layers import Bidirectional, GlobalMaxPool1D,concatenate,Concatenate,GlobalAvgPool1D,BatchNormalization,Flatten
from keras.layers import GlobalAveragePooling1D,GlobalMaxPooling1D,PReLU,Reshape,Conv2D,MaxPool2D,add,SpatialDropout1D
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K
from keras.engine.topology import Layer
from keras.callbacks import Callback
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.callbacks import EarlyStopping,ReduceLROnPlateau,ModelCheckpoint

from IPython.display import display, Markdown, HTML, Javascript
import ipywidgets as widgets

np.random.seed(2018)

%matplotlib inline
print("Python %s" % sys.version)
print("NumPy %s" % np.__version__)
print("matplotlib %s" % matplotlib.__version__)
print("TensorFlow %s" % tf.__version__)

In [None]:
questions_to_ask = ['Corona virus for smoking, pre-existing pulmonary or lung disease', 
                    'Co-infections or co-existing respiratory and viral infections make the virus more transmissible or virulent and other co-morbidities', 
                    'Neonates and pregnant women in corona virus',
                    'Socio-economic and behavioral factors on corona virus',
                    'Transmission dynamics, reproductive number, incubation period, serial interval, modes of transmission and environmental factors on corona virus',
                    'Severity, risk of fatality among symptomatic hospitalized patients on corona virus',
                    'Susceptibility of populations on corona virus',
                    'Public health mitigation measures on corona virus']

In [None]:
# Initiate the dropdown box
dropdown_question = widgets.Dropdown(options = questions_to_ask, description='Risk factor:')

# Display the dropdown boxes
display(dropdown_question)

def execute_analysis(ev):
    display(Javascript('IPython.notebook.execute_cell_range(IPython.notebook.get_selected_index()+1, IPython.notebook.ncells())'))

button = widgets.Button(description="Execute analysis")
button.on_click(execute_analysis)
display(button)

In [None]:
asked_question = dropdown_question.value
asked_question

In [None]:
CL_INPUT_DIRECTORY = '/kaggle/input/'
CL_WORKING_DIRECTORY = '/kaggle/classification/'
W2V_WORKING_DIRECTORY = '/kaggle/w2v/'

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
stopwords_list = stopwords.words('english')
wordnet_lemmatizer = WordNetLemmatizer()
en_stop = set(nltk.corpus.stopwords.words('english'))

In [None]:
summarizer = LexRankSummarizer(Stemmer('eng'))
summarizer.stop_words = get_stop_words('eng')

# 2. Helper functions

- some helper functions for analysis

In [None]:
# some functions used in the work
def format_name(author):
    try:
        middle_name = " ".join(author['middle'])

        if author['middle']:
            return " ".join([author['first'], middle_name, author['last']])
        else:
            return " ".join([author['first'], author['last']])
    except:
        return ""


def format_affiliation(affiliation):
    try:
        text = []
        location = affiliation.get('location')
        if location:
            text.extend(list(affiliation['location'].values()))

        institution = affiliation.get('institution')
        if institution:
            text = [institution] + text
        return ", ".join(text)
    except:
        return ""

def format_authors(authors, with_affiliation=False):
    try:
        name_ls = []

        for author in authors:
            name = format_name(author)
            if with_affiliation:
                affiliation = format_affiliation(author['affiliation'])
                if affiliation:
                    name_ls.append(f"{name} ({affiliation})")
                else:
                    name_ls.append(name)
            else:
                name_ls.append(name)

        return ", ".join(name_ls)
    except:
        return ""

def format_body(body_text):
    try:
        texts = [(di['section'], di['text']) for di in body_text]
        texts_di = {di['section']: "" for di in body_text}

        for section, text in texts:
            texts_di[section] += text

        body = ""

        for section, text in texts_di.items():
            body += section
            body += "\n\n"
            body += text
            body += "\n\n"

        return body
    except:
        return ""

def format_bib(bibs):
    try:
        if type(bibs) == dict:
            bibs = list(bibs.values())
        bibs = deepcopy(bibs)
        formatted = []

        for bib in bibs:
            bib['authors'] = format_authors(
                bib['authors'], 
                with_affiliation=False
            )
            formatted_ls = [str(bib[k]) for k in ['title', 'authors', 'venue', 'year']]
            formatted.append(", ".join(formatted_ls))


        return "; ".join(formatted)
    except:
        return ""


def load_files(dirname):
    filenames = os.listdir(dirname)
    raw_files = []

    for filename in tqdm(filenames):
        filename = dirname + filename
        file = json.load(open(filename, 'rb'))
        raw_files.append(file)
    
    return raw_files

def generate_clean_df(all_files,save_ram=True):
    """
    the main function to parse all json files
    """
    cleaned_files = []
    paper_dict = dict()

    for f in tqdm(all_files):
        file = json.load(open(f,'rb'))
        try:
            if save_ram:
                features = [
                    file['paper_id'],
                    file['metadata']['title'],
                    format_body(file['abstract']),
#                     format_body(file['body_text']),
                ]
            else:
                features = [
                    file['paper_id'],
                    file['metadata']['title'],
                    format_authors(file['metadata']['authors']),
                    format_authors(file['metadata']['authors'], with_affiliation=True),
                    format_body(file['abstract']),
                    format_body(file['body_text']),
                    format_bib(file['bib_entries']),
                    file['metadata']['authors'],
                    file['bib_entries']
                ]
            paper_dict[file['paper_id']] = f
            
        except:
            pass

        cleaned_files.append(features)
    if save_ram:
        col_names = ['paper_id', 'title', 'abstract']
    else:
        col_names = ['paper_id', 'title', 'abstract',
                     'affiliations', 'authors', 'text', 
                     'bibliography','raw_authors','raw_bibliography'
                    ]

    clean_df = pd.DataFrame(cleaned_files, columns=col_names)
#     clean_df = clean_df.drop(['raw_authors', 'raw_bibliography'], axis=1)    
    return clean_df,paper_dict


def clean_text(x):
    x = x.lower()
    regex = re.compile('([^\s\w]|_)+')
    sentence = regex.sub('', x).lower()
    sentence = sentence.split(" ")
    
    for word in list(sentence):
        if word in stopwords_list:
            sentence.remove(word)
    tokens = [wordnet_lemmatizer.lemmatize(w.strip()) for w in sentence if w not in stopwords_list and len(w)>=2]
    return tokens


def collect_simi_tokens(q_tokens,topn=20):
    simi_words = []
    simi_words += q_tokens
    for i,w in enumerate(q_tokens):
        if w in list(covid19_w2v_300.wv.vocab.keys()): 
            msword_1 = covid19_w2v_300.wv.most_similar(w,topn=topn)
            simi_words += list(set([v[0] for v in msword_1]))
            
        if w in list(covid19_w2v_30.wv.vocab.keys()):
            msword_2 = covid19_w2v_30.wv.most_similar(w,topn=topn)
            simi_words += list(set([v[0] for v in msword_2]))

        if w in list(covid19_w2v_50.wv.vocab.keys()):
            msword_3 = covid19_w2v_50.wv.most_similar(w,topn=topn)
            simi_words += list(set([v[0] for v in msword_3]))
            
            
    simi_words = list(set([w.lower() for w in simi_words]))
    simi_words_lem = [wordnet_lemmatizer.lemmatize(w) for w in simi_words]
    corona_words = ['corona','covid','coronavirus','2019ncov']
    return set(simi_words+simi_words_lem+corona_words)
    
    
def get_answer(question,features_list,topn,showabs = False,print_info=False):
    q_tokens = clean_text(question)
    q_tokens = set(list(collect_simi_tokens(q_tokens,10)))
    clean_df['title_abs_score'] = clean_df.apply(lambda row:scoring(q_tokens,row,features_list),axis=1)

    top_ind = clean_df.sort_values('title_abs_score',ascending=False)[:topn].index
    ids = []
    for ind in list(top_ind):
        article = clean_df.iloc[ind]
        pid = article['paper_id']
        
        if pid in paper_dict.keys():
            path = paper_dict[pid]
            f = json.load(open(path,'rb'))
            title = f['metadata']['title']
#             print('kkk',title)
            if print_info:
                print('paper_idï¼š{} \n'.format(pid))
                print('Title: {}\n'.format(title))
                if showabs:
                    print('Abstract: {}\n'.format(format_body(f['abstract'])))
                print('='*100)
            ids.append(pid)
        else:
            continue
    return ids
    
def get_summerization(paper_id,sentence_count=20):
    try:
        f = paper_dict[paper_id]
        file = json.load(open(f,'rb'))
        txt = format_body(file['body_text'])
        language = "english"
        sentence_count = sentence_count
        parser = PlaintextParser(txt, Tokenizer(language))
        summary = summarizer(parser.document, sentence_count)
        sentences = []
        for sentence in summary:
            print(sentence)
        return txt
    
    except:
        return 'No text for this paper'
         
    
def scoring(q_tokens,row,features_list):
    t_tokens = []
    for f in features_list:
        t_tokens+=row[f]
    if len(t_tokens)>10: # only think articles with more tokens
        t_tokens = [w.lower() for w in t_tokens if w not in stopwords_list and len(w)>2] 
        t_tonens_lem = [wordnet_lemmatizer.lemmatize(w) for w in t_tokens if w not in stopwords_list and len(w)>2]
        t_tokens = t_tokens + t_tonens_lem
        common_tokens = q_tokens.intersection(set(t_tokens))
        return len(q_tokens.intersection(set(t_tokens))) #+N
    else:
        return 0
    
class Lemma_Nword_phases():

    def __init__(self,
                 if_single_lemma,
                 sentence_token
                 ):
        '''
        Input for this class should be tokenized, cleaned sentence

        '''
        #Taken from Su Nam Kim Paper...
        grammar = r"""
                        NBAR:
                            {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns

                        NP:
                            {<NBAR>}
                            {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
                    """

        self.wnl=WordNetLemmatizer()
        self.chunker=nltk.RegexpParser(grammar)
        self._grammar=grammar
                
        self.result_ = None
        self.tokens_= None
        self.tagged_sent_= None
        
        self._if_single_lemma = if_single_lemma
        self._sentence_token=sentence_token
    
    def sentence_token_fit(self):
        self.tagged_sent_=  pos_tag(self._sentence_token)
        if self._if_single_lemma==False:
            self.chunker = nltk.RegexpParser(self._grammar)

    
    def singleW_lemma(self):
            
        def _get_wordnet_pos(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return None
        ## parse according to pos better->good
        lemmas_sent = []
        for tag in self.tagged_sent_:
            wordnet_pos = _get_wordnet_pos(tag[1]) or wordnet.NOUN
            lemmas_sent.append(self.wnl.lemmatize(tag[0], pos=wordnet_pos).lower()) 

        lemmas_sent = [w for w in lemmas_sent]
        try:
            lemmas_sent[0].remove('``')
        except:
            pass
        self.result_= lemmas_sent
        
def lemma_main(if_single_lemma,sentence_token):
    Lemma=Lemma_Nword_phases(if_single_lemma,sentence_token)
    Lemma.sentence_token_fit()
    Lemma.singleW_lemma()

    return ' '.join(Lemma.result_)

class Cleaning:
    def __init__(self,
                stopwords_list,
                medical_stopwords,
                sentence_token):
        
        self.sentence = sentence_token
        self.stopwords = stopwords_list
        self.medical_stopwords = medical_stopwords
        
    def _remove_nonalphaW(self):
        self.sentence = self.sentence.lower()
        
        try:
            self.sentence = re.sub('[^A-Za-z\s]+','',str(self.sentence))
        except:
            self.sentence = re.sub('^A-Za-z\s+','',str(self.sentence))
            
    def _remove_stopW(self):
        self.sentence = [x.strip() for x in self.sentence.split() if (x not in self.stopwords + self.medical_stopwords)]
        
def cleaning_call(sentence):
    clean = Cleaning(stopwords_list,['title','introduction','abstract'],sentence)
    clean._remove_nonalphaW()
    clean._remove_stopW()
    return(clean.sentence)

#Helper for getting synonyms for aparticular word using wordnet

from nltk.corpus import wordnet as wn

#Get all synonym sets
def get_all_synsets(word, pos=None):
    for ss in wn.synsets(word):
        for lemma in ss.lemma_names():
            yield (lemma, ss.name())

#Get all hyponyms
def get_all_hyponyms(word, pos=None):
    for ss in wn.synsets(word, pos=pos):
            for hyp in ss.hyponyms():
                for lemma in hyp.lemma_names():
                    yield (lemma, hyp.name())

#get all similar words
def get_all_similar_tos(word, pos=None):
    for ss in wn.synsets(word):
            for sim in ss.similar_tos():
                for lemma in sim.lemma_names():
                    yield (lemma, sim.name())
#Get all related 
def get_all_also_sees(word, pos=None):
        for ss in wn.synsets(word):
            for also in ss.also_sees():
                for lemma in also.lemma_names():
                    yield (lemma, also.name())

#Get all synonyms - calls above functions
def get_all_synonyms(word, pos=None):
    for x in get_all_synsets(word, pos):
        yield (x[0])
    for x in get_all_hyponyms(word, pos):
        yield (x[0])
    for x in get_all_similar_tos(word, pos):
        yield (x[0])
    for x in get_all_also_sees(word, pos):
        yield (x[0])
        
#Get synonyms main
def get_synonyms_for(word):        
    res = {word : []}
    for x in get_all_synonyms(word):
        res[word].append(x)
    return res

def preprocess_text(document):
    try:
        # Remove all the special characters
        document = re.sub(r'\W', ' ', str(document))

        # remove all single characters
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Remove single characters from the start
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

        # Substituting multiple spaces with single space
        document = re.sub(r'\s+', ' ', document, flags=re.I)

        # Removing prefixed 'b'
        document = re.sub(r'^b\s+', '', document)

        # Converting to Lowercase
        document = document.lower()

        # Lemmatization
        tokens = document.split()
        tokens = [wordnet_lemmatizer.lemmatize(word) for word in tokens]
        tokens = [word for word in tokens if word not in en_stop]
        tokens = [word for word in tokens if len(word) > 3]

        preprocessed_text = ' '.join(tokens)

        return preprocessed_text
    
    except:
        return ""

def read_document_metadata(data_path):
      #Get the publish time for the paapers
    vector_dict = {}

    all_files = glob.glob(data_path + '*.csv')
    
    metadata_csv = [f for f in all_files if 'metadata' in f]
    
    metadata = pd.read_csv(metadata_csv[0])

    metadata = metadata[["sha", "publish_time"]]
    
    metadata['publish_time'] = pd.to_datetime(metadata.publish_time)

    metadata.rename(columns = {'sha':'paper_id'}, inplace = True) 
    
    return metadata

def removeWords(sentence, words_remove):
    sentence_token = sentence.split()
    sentence_token = [x for x in sentence_token if 'http' not in x and len(x) <= 25 and x not in words_remove]
    return(' '.join(sentence_token))

def show_wordcloud(data, title = None):
    """
    show word cloud visualization
    """
    wordcloud = WordCloud(
        background_color='white',
        max_words=1000,
        max_font_size=40, 
        scale=5,
        random_state=1
    ).generate(str(data))

    fig = plt.figure(1, figsize=(15,15))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=14)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

# 3. Parsing raw data:

This section is to parse the json files and generate data frame with paper_id, title, abstract, text, etc.

- Run generate_clean_df(all_files) function to parse all json files. If parsed and saved before, just skip this and load as clean_df below to save time

- Based on parsed data, we also generated clean_df to save tokens for further analysis

In [None]:
# !ls '/kaggle/input/CORD-19-research-challenge'

In [None]:
DATA_PATH = '/kaggle/input/CORD-19-research-challenge/'
W2V_PATH = '/kaggle/input/ericsson-task2-pretrained-models-data/'

In [None]:
# get all files list
all_files = glob.glob(DATA_PATH+'**/**/pdf_json/*.json')
print(len(all_files))

In [None]:
%%time
merged_df,paper_dict = generate_clean_df(all_files,save_ram=True)

In [None]:
merged_df.columns

In [None]:
# clean_df = merged_df.drop('text',axis=1).copy()
# clean_df.columns

In [None]:
len(paper_dict)

In [None]:
clean_df = merged_df.copy()

In [None]:
%%time
usecols = ['title','abstract']
for f in usecols:
    print(f)
    clean_df[f] = clean_df[f].astype(str)
    clean_df[f] = clean_df[f].apply(lambda x:clean_text(x)) 

In [None]:
print(clean_df.shape)
clean_df.head()

# 4.  Models

This section will show how to categorize risk factors based on different models

- 4.1: Word2vec(w2c): we use multiple pretrained word2vec models to build recommender which shows the most relavant papers given input questions

- 4.2: Topic modeling: we use LDA to train topic modeling and generate rank score for each paper based on risk factors

- 4.3: Pesudo-labeling / adaptive training: we use w2c models in 4.1 to label some papers as risk factors and train a RNN to learn the classifications



## 4.1 train word2vec

train three different word2vev models with different sizes, we found it will be robust to capture similar words based on medical domain

In [None]:
#concatenate all text
# texts = pd.concat([clean_df['title'],clean_df['abstract']]).values

In [None]:
# %%time

# #train w2v with dim=300, epoches should be >=50, if load pretrained ,skip this
# covid19_w2v_300 = gensim.models.Word2Vec(size=300, window=5, min_count=5, sg=2, workers=4)
# covid19_w2v_300.build_vocab(texts)
# covid19_w2v_300.train(sentences=texts,total_examples=len(texts),epochs=100) 

In [None]:
# %%time
# # train w2v with dim=50,if load pretrained ,skip this
# covid19_w2v_50 = gensim.models.Word2Vec(size=50, window=3, min_count=5, sg=2, workers=4)
# covid19_w2v_50.build_vocab(texts)
# covid19_w2v_50.train(sentences=texts,total_examples=len(texts),epochs=100) 

In [None]:
# %%time
# # train w2v with dim=30,if load pretrained ,skip this
# covid19_w2v_30 = gensim.models.Word2Vec(size=30, window=3, min_count=5, sg=2, workers=4)
# covid19_w2v_30.build_vocab(texts)
# covid19_w2v_30.train(sentences=texts,total_examples=len(texts),epochs=100) 

In [None]:
# working_directory = W2V_WORKING_DIRECTORY
# if not os.path.isdir(working_directory):
#     print(f'Creating directory... {working_directory}')
#     os.makedirs(working_directory)

In [None]:
# with open(W2V_WORKING_DIRECTORY+'covid19_w2v_300.pickle','wb') as handle:
#     pickle.dump(covid19_w2v_300,handle)
# handle.close()

# with open(W2V_WORKING_DIRECTORY+'covid19_w2v_30.pickle','wb') as handle:
#     pickle.dump(covid19_w2v_30,handle)
# handle.close()

# with open(W2V_WORKING_DIRECTORY+'covid19_w2v_50.pickle','wb') as handle:
#     pickle.dump(covid19_w2v_50,handle)
# handle.close()

In [None]:
with open(W2V_PATH+'ericsson_covid19_w2v_300_2.pickle','rb') as handle:
    covid19_w2v_300 = pickle.load(handle)
handle.close()

with open(W2V_PATH+'ericsson_covid19_w2v_30.pickle','rb') as handle:
    covid19_w2v_30 = pickle.load(handle)
handle.close()

with open(W2V_PATH+'ericsson_covid19_w2v_50.pickle','rb') as handle:
    covid19_w2v_50 = pickle.load(handle)
handle.close()


In [None]:
Factor_1 = asked_question

print('=========================================== QUESTION====================================================== \n')
print('Risk Factor: ',Factor_1)
ids = get_answer(Factor_1,['title'],topn=10,showabs=True,print_info=True)
#check detail text body in top recommended articles
print('=========================================== SUMMARY ========================================================  \n')
print('SUMMARY for article: {}'.format(ids[0]))
txt = get_summerization(ids[0],sentence_count=5)
print('========================================= Visualization ===================================== \n')
wordcloud = WordCloud(background_color="white", max_words=500000, contour_width=3, contour_color='steelblue')
#Generate a word cloud
wordcloud.generate(txt)
#Visualize the word cloud
wordcloud.to_image()

In [None]:
del clean_df # release some memory
gc.collect()

## 4.2. Topic modeling

In this section, we use LDA to perform topic modeling so that we can generate relavant score for each risk factor

4.2.1 Configuration parameters to control the way models are built and training data is parsed and loaded

*load_pretrained_models*  - This parameter if false will result in building the models from scratch, otherwise loads the pretrained models. If true  should locate and load the pretrained model for lda topic model and fast text word embeddings. Must have internet enabled, otherwise set *load_pretrained_models* to False (True by default)

*fasttext_model_fname* =  If load_pretrained_models is set to true, the notebook expects the model is added using add data from the url and available in directory with file name "/kaggle/input/covid19-lda-topics-wv-ft/ericsson_covid19_wv_ft.pickle"

*topic_model_fname* = If load_pretrained_models is set to true, the notebook expects the model is added using add data from the url and available in directory with file name "/kaggle/input/covid19-lda-topics-wv-ft/ericsson_covid19_wv_ft_lda_topic.pickle"

*train_lda_multiple_steps* find an optimum lda topic model in multiple steps and decide the best model based on model coherence value.

In [None]:
load_pretrained_models = True

train_lda_multiple_steps = False

fasttext_wv_model_fname = W2V_PATH +"ericsson_covid19_wv_ft.pickle"

**4.2.1 Data loading and preprocessing**

Read and parse the json files and then preprocess and clean the data using lemmetizer and stemmer. Also  remove stopwords punctuation and other unwanted character. We use a custom helper function for cleaning and preprocessing the text.

In [None]:
merged_df['abstract'].fillna('', inplace=True)
    
merged_df['abstract_cleansed'] = merged_df['abstract'].apply(lambda x: cleaning_call(x))
    
merged_df['abstract_cleansed'] = merged_df['abstract_cleansed'].apply(lambda x: lemma_main(True,x))
    
merged_df['abstract_cleansed'].fillna("", inplace=True)
    
word_to_remove_abstract = ['et al','et all','no info','report','describe','abstract','permission','word','objective','study','background','authorfunder','without','right','reserve','reuse','text','allow','count']

word_to_remove_body = ['introduction','objective','preprint','peerreviewed','copyright','grant','biorxiv','doi','permissionthe','holder','httpsdoiorg10110120200110901801','license',
    'medrxiv','license','perpetuityis','display','perpetuitythe','et al','et','al','info']

words_remove = list(set(word_to_remove_body + word_to_remove_abstract))

merged_df['abstract_cleansed'] = merged_df['abstract_cleansed'].apply(removeWords, words_remove = words_remove)

merged_df.shape

In [None]:
merged_df.head()

4.2.2 Plot the most common key words in the data set as a word cloud 

In [None]:
show_wordcloud(merged_df['abstract_cleansed'])

4.2.3 Feed the preprocessed sentences to a tokenizer function and get text tokens using a word punctuation tokenizer.


In [None]:
final_corpus = [preprocess_text(sentence) for sentence in merged_df['abstract_cleansed'] if sentence.strip() !='']

word_punctuation_tokenizer = nltk.WordPunctTokenizer()

word_tokenized_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in final_corpus]

 4.2.4 Create the dictionary and bag of words corpus to feed to the topic model.

In [None]:
id2word = corpora.Dictionary(word_tokenized_corpus)

corpus = [id2word.doc2bow(text) for text in word_tokenized_corpus]

4.2.5 Build LDA model. The function below is to find an Optimum model by building more than one lda models, varying the number of topics each time and to choose the one with the highest choherence value. Check the model and its coherence values for different k from the plot below.

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamodel.LdaModel(corpus=corpus,  #corpus
                                           id2word=id2word, #dictionary
                                           num_topics=num_topics, #num of topics
                                           random_state=2020, #
                                           update_every=1, # how often the model paramters needs to be updated 
                                           chunksize=100,   # number of documents to be used in each training chunk
                                           passes=10,   #total number of training passes
                                           alpha='auto',  #parameter affecting the sparcity of the topics 
                                           per_word_topics=True)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

In [None]:
if train_lda_multiple_steps == True:
    model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=word_tokenized_corpus, start=2, limit=16, step=2)
    optimal_model = model_list[1] # the model with highest coherence value - 4 is the optimal number of topics being discovered on a sufficently large data set
else:
    optimal_model = gensim.models.ldamodel.LdaModel(corpus=corpus,  #corpus
                                           id2word=id2word, #dictionary
                                           num_topics=4, #num of topics
                                           random_state=2020, #
                                           update_every=1, # how often the model paramters needs to be updated 
                                           chunksize=100,   # number of documents to be used in each training chunk
                                           passes=10,   #total number of training passes
                                           alpha='auto',  #parameter affecting the sparcity of the topics 
                                           per_word_topics=True)

4.2.6 Plot the coherence values for the models built for different k

In [None]:
if load_pretrained_models == False:
    limit=16; start=2; step=2;
    x = range(start, limit, step)
    plt.plot(x, coherence_values)
    plt.xlabel("Num Topics")
    plt.ylabel("Coherence score")
    plt.legend(("coherence_values"), loc='best')
    plt.show()

    # 3.4 Print Coherence Score (perplexity not printed)
    coherence_model_lda = CoherenceModel(model=model_list[0], texts=word_tokenized_corpus, dictionary=id2word, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print('\nCoherence Score: ', coherence_lda)

    for m, cv in zip(x, coherence_values):
        print("Num Topics =", m, " has Coherence Value of", round(cv, 4)) # high coherence is better, from the output choose the model with highest confidence or least k

4.2.7 Get the topics for the optimum model.

In [None]:
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

4.2.8 Plot the topics. It might take a while for it to plot based on the size of the data.

Notes:
****
1. Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.
2. A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.
3. A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

In [None]:
def plot_topics(optimal_model, corpus, id2word):
    pyLDAvis.enable_notebook()
    vis = pyLDAvis.gensim.prepare(optimal_model, corpus, id2word)
    pyLDAvis.save_html(vis, '/kaggle/working/lda.html')
    return vis

plot_topics(optimal_model, corpus, id2word)

4.2.9 Add a column to the data set to indicate the topic the document belongs to. The topic info to the data set is useful so that we can use it to test our models accuracy or build a analyser to identify the relevant papers on a particular topic.
For example, if the topic talks about virus risks, we can use that topic and the papers corressponding to that for getting the answer to  our query (like whats the potential risks associated to virus on neonates and pregnant women).

In [None]:
def format_topics_sentences(doc_ids, ldamodel=optimal_model, corpus=corpus, texts=merged_df):
    """Get the main topics for each document, Get the Dominant topic, Perc Contribution and Keywords for each document
    Add the topics to the original text dataframe
    
    Parameters
    ----------
    doc_ids: Dataframe
        The papaer_ids column and its values that is unique
        
    ldamodel: LDA
        The optimum model - which has the highest coherence value , default is first model
    
    corpus: corpus
        The Bag of words representation created from the dictionary of word tokens in the text document

    texts: DataFrame
        The text to fit the model on
        
    Returns
    -------
    sent_topics_df: DataFrame
        The original dataframe text with topic(whichever is dominat) added to it, 
        with addtional columns like percentage of words in the text that belong to the text etc.
    """
    
    sent_topics_df = pd.DataFrame()


    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row[0], key=lambda x: (x[1]), reverse=True)
        
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0: 
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    ids = pd.Series(doc_ids)
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([ids, sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(merged_df['paper_id'], ldamodel=optimal_model, corpus=corpus, texts=merged_df['abstract_cleansed'])

df_dominant_topic = df_topic_sents_keywords.reset_index()

df_dominant_topic.columns = ['Row_id','Paper_ids', 'DominantTopic', 'TopicPercContrib', 'Keywords','Text']

df_dominant_topic.head(10)

In [None]:
del model_topics #Do some cleaning
del id2word
del corpus
gc.collect()

# 4.3 Feature selection using semantic similarity, synonyms and tfidf transform.

In this section we would create vector representations to find semantic similarities between the words and will couple it with word synonyms and tfidf transform to build a relevant feature extractor. 

* This involves vector representations - to find semantic similarities between the words,
* A utility to find synonyms of features
* Tf-idf transform and significant feature extractor based on tfidf score.
* In the end we can use the result to find the cluster of topics(relevant paper ids) that belong to a particular query and validate and compare it against the lda topics in section 4.2

4.3.1 Create model for word embeddings and use created vector representations for finding the semantic similarities between words in the text. 

We use 50 iterations and will use the model to find the most similar words for a given word input. 

Other parameters for word model include embedding size corressponds to the dimesions of the vector(60), window size of 40 size of number of words occuring  before and after word, minimum frequency of a word in the corpus for which the word representation will be generated(5), and down sampling (1e-2) for most frequently occuring word.

In [None]:
embedding_size = 60    
window_size = 40   
min_word = 5 
down_sampling = 1e-2 

In [None]:
%%time
#Load pretrained model if available - enable intenet if downloading from external. Otherwise build the model.
if load_pretrained_models == True:
    ft_model = pickle.load(open(fasttext_wv_model_fname, 'rb'))
else:
    ft_model = FastText(word_tokenized_corpus,
                        size=embedding_size,
                        window=window_size,
                        min_count=min_word,
                        sample=down_sampling,
                        sg=1,
                        iter=50)

4.3.2 Check the word representation for a particular word of interest(virus) in example, which is represented as a 60 dimensional vector.

In [None]:
print(ft_model.wv['virus'])

4.3.3 Check the similarity metrics(cosine similarity)  for two words that might be related, using the model. High value would indicate high similarity. So we will use this metric as a basis for identifying the most similar words for a given set of words.

In [None]:
print(ft_model.wv.similarity(w1='virus', w2='infection'))

4.3.4 Helper functions for building a dictionary of related words and functions that does relevant features lookup based on custom dictionary.

* For a given text get all similar words from model and build a dictionary, also extend the dictionary by finding the synonyms for words.

* Use tfidf to  transform text into a meaningful representation of numbers. Use the tfidf weights and then apply a filter based on the weight(frequency of words) to form a dominant subset of text (covid-19 papers) as search target. 

* Use the tfidf features and its score together with the custom dictionary (dictionary built using the fasttext model for most similar words, and  wordnet for synonyms) to extract the relevant papers from the covid data set (the tfidf relevant features are matched against the text in input query which is a given sentence or word input) 

In [None]:
 def fit_idf_get_word_counts(count_vectorizer, fullDocPreprocessed):
    """Build the vocabulary of words from full text - columns as feature names and column values as the word counts (IDF)
    Compute the IDF(inverse document frequency) using tfidf transform on the word count sparse matrix
     
    Parameters
    ----------
    count_vectorizer: CountVectorizer
        The count_vectorizer created with min document frequency of 2, max 95%, and remove stop words, accepts word tokens. 
    
    Returns
    -------
    tfidf_transformer: TfidfTransformer
        The sparse matrix with IDF counts for each feature name in the vocabulary.
    """
    sparse_matrix = count_vectorizer.fit_transform(fullDocPreprocessed)

    tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)

    tfidf_transformer.fit(sparse_matrix)
    
    return tfidf_transformer

    
def preprocess_tokenize_text(query):
    """ Preprocess a line of text and get the word tokens.
    Parameters
    ----------
    query: String
        The input query string text value. 
    
    Returns
    -------
    word_tokens: List
        List of word tokens
    """
    preprocessed_sentences = [preprocess_text(sentence) for sentence in query if sentence.strip() !='']
    word_punctuation_tokenizer = nltk.WordPunctTokenizer()
    word_tokens = [word_punctuation_tokenizer.tokenize(sent) for sent in  preprocessed_sentences]

    return word_tokens[0]


def sort_tfid_score(tfidf_vector):
    """ Utility function to sort the values of the tfidf vector in descending order.
    Parameters
    ----------
    tfidf_vector: Matrix
        The tfdif transform of the document text, with feature names as columns and value as tfidf score for each row of docuemnt. 
    
    Returns
    -------
    col_index_score: Tuple
        column index, score tuple sorted desc
    """
    col_index_score = zip(tfidf_vector.col, tfidf_vector.data)
    return sorted(col_index_score, key=lambda x: (x[1], x[0]), reverse=True) 


def extract_topn_from_vector(vocabulary, sorted_tfidf_scores,  all_related_features, top_n=1000):
    
    """Lookup a particular feature name(key) or its synonyms in the vocabulary, and return feature nmae(key) and its tfidf-score as 
    This is a feature extractor to extract relevant feature and its score, 
    will iterate through each feature in tfidf vector and look up the topn feature name and its score,
    and its score is reported back(tuple - word in input as key and scores and its features as value).
     
    Parameters
    ----------
    vocabulary: List
        The list of feature names
        
    sorted_tfidf_scores: Tuple
        List of feature names topn and its score
        
    all_related_features: List
       all related feature for a given feature.
       
    topn:int
        The number of features (topn) from vocabulary of features to consider for search.
    
    Returns
    -------
    Tuple
        The tuple with the word/text in input token as key and list of features and its tfidf score summed.
    """
   
    sorted_tfidf_scores = sorted_tfidf_scores[:top_n] #Top n items (its important to tune this , so that we can have a decent representation of document)

    scores = []
    feature_names_found = []
    
    for index, score in sorted_tfidf_scores:
        feature_name = vocabulary[index]
       
        for related_feature in all_related_features:
            if feature_name == related_feature:
                scores.append(score)
                feature_names_found.append(feature_name)


    results = {}
    for index in range(len(feature_names_found)):
        if results.get(feature_names_found[index]) != None:
            results[feature_names_found[index]]= results.get(feature_names_found[index]) + scores[index] 
        else:
            results[feature_names_found[index]]=scores[index]
    
    return results


def extract_features_from_tfidf_vector(document, look_for, tf_idf_vector, feature_names, topn = 1000):
    
    """Process the tf-idf scores vector, get the top n feature names by ranking the feature level tf-idf scores.
    Use the feature extractor to extract relevant feature and its score, 
    will iterate through each row of tfidf vector and look up the topn feature name and its score,
    and its score is reported back(tuple - word in input as key and scores and its features as value).
     
    Parameters
    ----------
    documents: DataFrame
        The input text for training - which has the full vocabulary of feature names
        
    look_for: List
        List of feature names to look for in vector
        
    tf_idf_vector: Matrix
       tfidf transformation of document text
       
    topn:int
        The number of features (topn) from vocabulary of features to consider for search.
    
    Returns
    -------
    Tuple
        The tuple with the word/text in input token as key and list of features and its tfidf score summed.
    """
    
    scores_combined = []

    matching_scores = []

    results = []

    for i in range(tf_idf_vector.shape[0]):
    
        tfidf_vecto_for_doc = tf_idf_vector[i]
    
        sorted_items=sort_tfid_score(tfidf_vecto_for_doc.tocoo())

        keywords=extract_topn_from_vector(feature_names,sorted_items,look_for, topn)
    
        results.append(sum(keywords.values()))
        
    return results

def get_scores_for_word_group(documents, 
                              words_input,
                             tf_idf_vector,
                              feature_names,
                             topn = 1000):  #extract_relevant_text_on_keywords
    
    """Use the feature extractor to extract relevant feature and its score, 
    will iterate thtough each word_input and its related words, its the related words thats matched with the tfidf vecotr feature name,
    and its score is reported back(tuple - word in input as key and scores and its features as value).
     
    Parameters
    ----------
    documents: DataFrame
        The input text for training - which has the full vocabulary of feature names
        
    words_input: tuple
        word as key and related words as value
        
    feature_names: List
        List of  features in the vocabulary
    topn:int
        The number of features (topn) from vocabulary of features to consider for search.
    
    Returns
    -------
    results: Tuple
        The tuple with the word/text in input token as key and list of features and its tfidf score
    """
    
    results = {}
    
    for word_group, word_list_to_search in  words_input.items():
        temp_list = word_list_to_search
        search_for_features = list(set(temp_list + [word_group] ))
        sum_scores_matching_words = extract_features_from_tfidf_vector(documents, 
                                                           search_for_features, 
                                                           tf_idf_vector, 
                                                           feature_names,
                                                           topn)
        
        results[word_group] = sum_scores_matching_words
        
    return pd.DataFrame(results)

def get_most_similar_keywords_from_model_for_sentence(sentence, topn_similar_words = 5, cosine_similarity = 0.6):
    """Get most similar words (only the topn_similar_words) for the sentence input that has similarity score
    above(cosine_similarity).
     
    Parameters
    ----------
    sentence: List
        The list of tokens- or feature names or sentence text for which to find the most similar words 
        
    topn_similar_words: int
        Get the top n similar words out of all the similar words
        
    cosine_similarity: float
        Get the similar words which has a score above
    
    Returns
    -------
    Tuple
        The tuple with the word/text in input token as key and list of related words as the value
    """
    result = {}
    result[sentence] = [item[0] for item in ft_model.wv.most_similar(sentence)[:topn_similar_words] 
                        if item[1] > cosine_similarity] 
    return result

def get_most_similar_keywords_from_model_and_wordnet_for_sentence(sentence, topn_similar_words = 5, cosine_similarity = 0.6):
    """Get most similar words (only the topn_similar_words) for the sentence input that has similarity score
    above(cosine_similarity). Also append the synonyms for the given word to the similar words list.
     
    Parameters
    ----------
    sentence: List
        The list of tokens- or feature names or sentence text for which to find the most similar words 
        
    topn_similar_words: int
        Get the top n similar words out of all the similar words
        
    cosine_similarity: float
        Get the similar words which has a score above
    
    Returns
    -------
    Tuple
        The tuple with the word/text in input token as key and list of related words as the value
    """
    result = {}
    result[sentence] = [item[0] for item in ft_model.wv.most_similar(sentence)[:topn_similar_words] 
                        if item[1] > cosine_similarity] 
    for word in preprocess_tokenize_text(sentence):
        synonyms_set = set(get_synonyms_for(word).get(word)) # Enrich the semantic similar word list further - Combine the fast text model results(semantically similar) with wordnet synonyms and generate a full list of related words    
    
        related_words = set(result.get(word))
    
        result[word] = list(synonyms_set.union(related_words))  
        
    return result

def get_most_similar_keywords_from_model(word_tokens, topn_similar_words = 5, cosine_similarity = 0.6):
    
    """Get most similar words (only the topn_similar_words) for the words in the word_tokens which has similarity score
    above(cosine_similarity).
     
    Parameters
    ----------
    word_tokens: List
        The list of tokens- or feature names or words for which to find the most similar words 
        
    topn_similar_words: int
        Get the top n similar words out of all the similar words
        
    cosine_similarity: float
        Get the similar words which has a score above
    
    Returns
    -------
    Tuple
        The tuple with the word/text in input token as key and list of related words as the value
    """
    
    semantically_similar_words = {words: [item[0] 
                                          for item in ft_model.wv.most_similar(words)[:topn_similar_words] 
                                          if item[1] > cosine_similarity] #Uses the fast text model
                                          for words in word_tokens}
    return semantically_similar_words

def get_most_similar_keywords_from_model_and_wordnet(input_tokens, topn_similar_words=5, cosine_similarity = 0.6):
    """Get most similar words (only the topn_similar_words) for the words in the input_tokens which has similarity score
    above(cosine_similarity). Also append the synonyms for the given word to the similar words list.
     
    Parameters
    ----------
    input_tokens: List
        The list of tokens- or feature names or words for which to find the most similar words 
        
    topn_similar_words: int
        Get the top n similar words out of all the similar words
        
    cosine_similarity: float
        Get the similar words which has a score above
    
    Returns
    -------
    Tuple
        The tuple with the word/text in input token as key and list of related words as the value
    """
    semantically_similar_words = get_most_similar_keywords_from_model(input_tokens, topn_similar_words, cosine_similarity) 

    for word in input_tokens:
        synonyms_set = set(get_synonyms_for(word).get(word)) # Enrich the semantic similar word list further - Combine the fast text model results(semantically similar) with wordnet synonyms and generate a full list of related words    
    
        related_words = set(semantically_similar_words.get(word))
    
        semantically_similar_words[word] = list(synonyms_set.union(related_words))  
        
    return semantically_similar_words

def filter_documents_by_score(intermediate_results_df, scores_above = 0):
    """Function to filter scores and get the subset with scores above the threshold.
     
    Parameters
    ----------
    intermediate_results_df: Dataframe
        The processed dataframe that has the relevant papers got from the analyser and score per feature
        
    scores_above: int
        score per feature threshold default 0
    
    Returns
    -------
    DataFrame
        The dataframe with rows having Score greater than threshold
    """
    
    all_numeric_cols_above_threshold = lambda col_name:  col_name > scores_above if np.isreal(col_name) else True 

    intermediate_results_df = intermediate_results_df[intermediate_results_df.applymap(all_numeric_cols_above_threshold).all(1)]
    
    return intermediate_results_df

def get_combined_word_scores_sorted(intermediate_results_df):
    """Function to sum up individual scores of individual score and find total score per document.
     
    Parameters
    ----------
    intermediate_results_df: Dataframe
        The processed dataframe that has the relevant papers got from the analyser and score per feature
    
    Returns
    -------
    DataFrame
        The dataframe with a new column Score, which has the score per document feature summed.
    """

    intermediate_results_df['Score'] = intermediate_results_df[intermediate_results_df.select_dtypes(include=np.number).columns].sum(axis=1)

    intermediate_results_df = intermediate_results_df.sort_values(by = ['Score'],  ascending=False, inplace = False)
    
    return intermediate_results_df

def join_docs_with_scores(merged_df, intermediate_results_df):
    """Function to merge two data frames - based on index match of the right argument.
    Gets all the documents(papaers and its paper_id) from the original dataframe - that is contained in the intermediate_results_df

    Parameters
    ----------
    merged_df : DataFrame
        The input document text dataframe with paper_id and all relevenat columns thats the training text
    intermediate_results_df: Dataframe
        The processed dataframe that has the relevant papers got from the analyser
    
    Returns
    -------
    DataFrame
        The relevant papers from original dataframe(left) which is contained in the data frame(on right), merged on papaer_id
        
    """
    
    doc_subset = merged_df.iloc[intermediate_results_df.index][['paper_id','title', 'abstract']] #Choose columns to show to end user
        
    return pd.concat([doc_subset.reset_index(drop=True), intermediate_results_df.reset_index(drop=True)], axis=1)


def get_documents_sorted_by_relevance(queries, document, tf_idf_vector, feature_names):
    """Get the documents which has a non zero tfidf score for its features, 
    and features to look for is extracted from  the input query.
    Input query is parsed and the features to search in documents is 
    its semantically similar features and the feature synonymns.
    The tfidf score for matching features are then added and returned as total score for the input query.

    Parameters
    ----------
    queries : List
        The input search query list- parsed one by one if list contain multiple queries
    document: Dataframe
        The Preprocessed text with all relevant columns
    tf_idf_vector: Matrix
        The tfidf transformation of the document text, the vocabulary of features is limited to 10000 default(configurable) 
    feature_names: List
        The list of feature vocabulary in the text
    
    Returns
    -------
    DataFrame
        The relevant papers ordered by score, the papaer_id, text, score and publish time
        
    """
    
    results = {} 
    
    all_words_dict = {}

    meta_data = read_document_metadata(DATA_PATH)
    
    for query in queries:
        
        word_tokens = preprocess_tokenize_text([query])

     #   all_related_keywords_dict = get_most_similar_keywords_from_model_for_sentence(query, topn_similar_words = 20, 
      #                                                                              cosine_similarity = 0.5)
        
        #all_related_keywords_dict = get_most_similar_keywords_from_model_and_wordnet_for_sentence(query, topn_similar_words = 20, 
         #                                                                           cosine_similarity = 0.5)
        
        all_related_keywords_dict = get_most_similar_keywords_from_model_and_wordnet(word_tokens, topn_similar_words = 20, 
                                                                                   cosine_similarity = 0.6)
        
        #all_related_keywords_dict = get_most_similar_keywords_from_model(word_tokens, topn_similar_words = 20, 
                                                                             #    cosine_similarity = 0.6)
 
        scores_for_each_word_group_df = get_scores_for_word_group(document, 
                                                                  all_related_keywords_dict,
                                                                 tf_idf_vector,
                                                                 feature_names,
                                                                 topn = 10000)

        scores_for_each_word_group_df = filter_documents_by_score(scores_for_each_word_group_df, scores_above = 0)

        final_result = get_combined_word_scores_sorted(scores_for_each_word_group_df)
        
        final_result = join_docs_with_scores(merged_df, final_result)

        final_result =  final_result.merge(meta_data, how='inner', left_on='paper_id', right_on='paper_id')
        
        results[query] = final_result
        
        all_words_dict[query] = all_related_keywords_dict
            
    return results, all_words_dict


def display_results(relevant_papers, year_filter = 2016):
    """Prints the relevant papers result

    Parameters
    ----------
    relevant_papers : DataFrame
        The relevant papers retrived for search result and order by scores and publish time
    
    year_filter : int
        The filter for publish time - get papers with publish time greater than or equal to the year
        
    Returns
    -------
    None
        
    """
        
    for query_input, relevant_paper in relevant_papers.items(): 
        #relevant_paper['publish_time'] = relevant_paper[relevant_paper.publish_time.dt.year > year_filter]
        relevant_paper = relevant_paper[pd.to_datetime(relevant_paper.publish_time).dt.year >= year_filter]
        print(f"Retrieved {relevant_paper.shape[0]} papers related to - \"{query_input}\"")
        display(relevant_paper[['paper_id', 'title', 'abstract', 'publish_time', 'Score']].
                 sort_values(by=['Score'], ascending = False))
    
    return None

4.3.5 Get the list of questions to get the relevant papers from the model

In [None]:
queries = questions_to_ask

4.3.6 Use tfidf to transform text into a meaningful representation of numbers. Use the tfidf weights and then apply a filter based on the weight(frequency of words) to form a dominant subset of text results(covid-19 papers) for the input query.

In [None]:
%%time

count_vectorizer = CountVectorizer(max_df=0.95, min_df = 2, stop_words='english') #adjust min_df, max_df accordingly

#Build the vocabulary  database of words from full text(learn) - represent as DTM word counts (IDF)
tfidf_transformer = fit_idf_get_word_counts(count_vectorizer, merged_df['abstract_cleansed'])

#Compute tfidf weights for a  document subset(test). from which we would extract the relevant features 
tf_idf_vector = tfidf_transformer.transform(count_vectorizer.transform(merged_df['abstract_cleansed']))

#Use the  tfidf vector to extract relevant papaers sort it by tfidf score
relevant_papers, all_words_dict = get_documents_sorted_by_relevance(queries, 
                                                     merged_df['abstract_cleansed'], 
                                                     tf_idf_vector, 
                                                     count_vectorizer.get_feature_names())

4.3.6 Display the results - the relevant papers for the query - can filter the result and only display the latest papers. Shows all papers that are published on or before year 2018. Change the year_filter accordingly to get the required list of papers.

In [None]:
display_results(relevant_papers, year_filter = 2018)

4.3.7 Highlight the key features that match in the relevant papers retrieved from the analysis

In [None]:
# Create a DF with the top 5 relevant papers

words_to_color_dict = {}

for query_input, relevant_paper in relevant_papers.items():
    # Create a list of the keywords
    words_to_color_list = []
    for words in all_words_dict.get(query_input).values():
            for word in words:
                words_to_color_list.append(word)
    #[[words_to_color_list.add(word) for word in words] for words in all_words_dict.get(query_input).values()]
    
    words_to_color_dict[query_input] = words_to_color_list

In [None]:
# Print the recommended papers.
for query_input, words_to_color in words_to_color_dict.items():
    display(Markdown('# <span style="color:#1174E6;font-weight:bold">Result</span>'))

    articles_topn_df = relevant_papers.get(query_input).head(5)
    
    if len(articles_topn_df) == 0:
        display(Markdown('No relevant papers found on <span style="background-color:#FADA5E;font-weight:bold">' + query_input + '</span>.'))
    else:
        if len(articles_topn_df) == 1:
            display(Markdown('To learn more about <span style="background-color:#FADA5E;font-weight:bold">' + query_input + '</span> we recommend exploring the following paper.'))
        else:
            display(Markdown('To learn more about <span style="background-color:#FADA5E;font-weight:bold">' + query_input + '</span> we recommend exploring the following ' + str(len(articles_topn_df)) + ' papers.'))

        for index, row in articles_topn_df.iterrows():
            print(row['title'])

        for index, row in articles_topn_df.iterrows():
            display(Markdown('### <span style="color:#1174E6;font-weight:bold"> Paper ' + str(index+1) + ': ' + row['title'] + '</span>'))
            print(row['abstract'])

In [None]:
del relevant_papers # release some memory
del articles_topn_df
del words_to_color_dict
gc.collect()

## 4.4. Classification of questions based on RNN

This section is to use pretrained word2vec models done in section 4 to generate tags(classes) for supervised learning based 
on RNN. This section has to be run separately because the processing schedule is different. After completing this section, we can save NN model as model.h5

The general pipeline is:

- Use pretrained w2v to find some relevant key words for basic risk factors. Then we create distionary for each classes

- Label each article with its abstract and conclusion based on the above dictionary

- Build a RNN (LSTM,GRU) to learn the classifications using tensorflow.keras APIs



`NOTE: this section should be completed individually and save the model.h5`



In [None]:
# define all constants 
# data path
CL_DATA_NAME = "CORD-19"
CL_TRAINING_DATA_RATIO = 0.8 # Training data ratio
CL_VALIDATION_DATA_RATIO = 1.0 - CL_TRAINING_DATA_RATIO

# Preprocessing
CL_SEQUENCE_LENGTH = 256  # Feature fixed sequence length
CL_SHUFFLE_BUFFER_SIZE = 1024 # Dataset shuffling buffer size
CL_NUM_PARALLEL_CALLS = 4 # number of CPUs to use for training

# Training
CL_MODEL_ARCHITECTURE_NAME = "cls" # 
CL_MAX_VOCABULARY_SIZE = 500000 # Maximum vocabulary size
CL_EMBEDDING_DIM = 256 # Embedding dimension
CL_ACTIVATION_LAYER_UNITS = 128
CL_BATCH_SIZE = 128 # Bach size
CL_EPOCHS = 30 # Number of epochs
CL_NUM_WORKERS = 4

> ### 4.4.1 Create a dictionary for 8 classes

- Create 8 classes (8 risk factors) labels based on key words mapping using word2vec pretrained in section 4


- Save the dictionary for future processing

In [None]:
keywords_classes = OrderedDict()

keywords_classes['lung disease'] = ['lung','pulmonary','smoke','vascular','smoking','nicotine','asthma'\
                            'airflow','cigar','corona','covid','coronavirus','2019ncov']

keywords_classes['co-infection and co-morbidity'] = ['coinfection','coinfections','coinfected','coinfecting',\
                'codetection','copathogen','codetections','comorbidity','comorbidities',
                'corona','covid','coronavirus','2019ncov']

keywords_classes['neonate and pregnant women'] = ['woman', 'women','pregnant', 'neonate','newborn',\
                             'infant','gestation','neonatal','corona','covid','coronavirus','2019ncov']

keywords_classes['socioeconomic'] = ['socioeconomic', 'economic', 'social', 'socio','society','demographic',
    'behavioral', 'finance', 'psychosocial', 'behavioural','corona','covid','coronavirus','2019ncov'
]

keywords_classes['transmission dynamics'] = [
    'transmission', 'transmissible', 'crosstransmission','humantohuman','incubation','persontoperson',\
    'geographic', 'geographical','temporal', 'spatial','spatio','prevalence',\
    'reproductive','reproduce','asymptomatic','seasonality','airborne','aerosol','corona','covid','coronavirus',\
    '2019ncov','droplets','secretions','latent','\br0\b'
]

keywords_classes['severity of disease'] = ['risk', 'risks','fatality','motality','symptomatic','older','aged','senior',
                                           'morbidity','death','corona','covid','coronavirus','2019ncov'
                                           ]


keywords_classes['susceptibility of populations'] = ['susceptibility', 'population','sensitivity','sensitive','susceptible'
                                                     'corona','covid','coronavirus','2019ncov']


keywords_classes['public control'] = [
                'public', 'health','control', 'mitigation','mitigating','planning','prevention','publichealth',\
                'preparedness','intervention','mitigating','corona','covid','coronavirus','2019ncov'
]

### 4.4.2 Generate label dictionaries

In [None]:
label_int_to_str = OrderedDict()
for i, label in enumerate(keywords_classes.keys()):
    label_int_to_str[i] = label

label_str_to_int = OrderedDict()
for i, label in enumerate(keywords_classes.keys()):
    label_str_to_int[label] = i

In [None]:
print(label_int_to_str,label_str_to_int)

In [None]:
total_classes = len(label_str_to_int)

### 4.4.3 Parsing the dataset


In [None]:
for dirname, _, _ in os.walk(CL_INPUT_DIRECTORY):
    print(dirname)

In [None]:
publication_directory = os.path.join(CL_INPUT_DIRECTORY,'CORD-19-research-challenge')

In [None]:
publication_names = [
    'custom_license',
    'biorxiv_medrxiv',
    'comm_use_subset',
    'noncomm_use_subset'
]

> ### 4.4.4 Create publication class

In [None]:
class PublicationModel(object):
    """Publication class to store and process publication information."""
    
    def __init__(self, name: str, directory: str):
        """Initialize publication object.
        
        Arguments:
            name {str} -- the name of the publication.
            directory {str} -- the directory of the publication.
        """
        self.name = name
        self.directory = directory
        self.article_filenames: List[str] = []
        self.articles: ArticleModel = []
            
    def collect_articles(self):
        """Collect all article filenames in publication directory."""
        self.article_filenames = os.listdir(os.path.join(self.directory,
                                                         self.name, self.name,
                                                         'pdf_json'))

    def __repr__(self):
        """Return publication short information."""
        return (f'{self.name} with {len(self.article_filenames)} articles.')

In [None]:
# loop and find all articles
publications: List[PublicationModel] = []
for publication_name in publication_names:
    publication = PublicationModel(publication_name, publication_directory)
    publication.collect_articles()
    print(publication.__repr__())
    publications.append(publication)

In [None]:
total_articles = 0
for publication in publications:
    total_articles += len(publication.article_filenames)

print('total articles:', total_articles)

### 4.4.5 Create Article model 

- ArticleModel class: to get all information for each paper: paper_id, title,abstract, conclusion if apply, normalized key words frequence and label assigned

In [None]:
class ArticleModel(object):
    """Article class to store the entire article information."""
    
    def __init__(self, paper_id: str, title: str, abstract: str, conclusion: str):
        """Initialize article object.
        
        Arguments:
            paper_id {str} -- 40-character sha1 of the PDF.
            title {str} -- the title of the article.
            abstract {str} -- the text abstract of the article.
            conclusion {str} -- the text conclusion of the article.
        """
        self.paper_id = paper_id
        self.title = title
        self.abstract = abstract
        self.conclusion = conclusion
        self.normalized_keywords_frequency: Dict[str, float] = OrderedDict()
        self.label_nfreq: (str, float) = (None, 0.0)
            
        
    def get_normalized_keywords_frequency(self, keywords_classes: Dict[str, List[str]]) -> Dict[str, float]:
        """Get normalized keywords frequency.
        
        Arguments:
            keywords_classes {Dict[str, List[str]]} -- keywords classes.
        Returns:
            Dict[str, float] -- pair of class and the normalized frequency.
        """
        for class_name, keywords in keywords_classes.items():
            frequency = 0
            for keyword in keywords:
                if keyword in self.abstract or keyword in self.conclusion:
                    frequency += 1

            normalized_frequency = float(frequency) / len(keywords)
            self.normalized_keywords_frequency[class_name] = normalized_frequency
            
        return self.normalized_keywords_frequency
    
    
    def assign_label(
        self, keywords_classes: Dict[str, List[str]]) -> (str, float):
        """Assign label from keywords classes.
        
        Arguments:
            keywords_classes {Dict[str, List[str]]} -- keywords classes.
        Returns:
            str -- class label and the normalized frequency.
        """
        normalized_keywords_frequency = \
            self.get_normalized_keywords_frequency(keywords_classes)
        # get the label based on the highest frequency
        label, nfreq = sorted(normalized_keywords_frequency.items(),
                              key=lambda x: x[1],
                              reverse=True)[0]
        if nfreq:
            self.label_nfreq = (label, nfreq)
        else:
            self.label_nfreq = (None, 0.0)
        
        return self.label_nfreq
        

    def __repr__(self):
        """Return article in readable string."""
        return (f'{self.paper_id}\n{self.title}\n\n' +
                f'Abstract:\n{self.abstract}\n\n' +
                f'Conclusion:\n{self.conclusion}\n')

### 4.4.6 Article Ingestor 

- This class is to parse json for each article and save as instance

In [None]:
class ArticleIngestorInterface(object):
    """Abstract class for ingestor classes."""

    allowed_extensions: List[str] = []

    @classmethod
    def can_ingest(cls, path: str) -> bool:
        """Check if file extension is supported for ingestion.

        Arguments:
            path {str} -- article file location.
        Returns:
            bool -- True if the extension is supported and False otherwise.
        """
        ext = path.split('.')[-1].lower()
        return ext in cls.allowed_extensions

    @classmethod
    @abstractmethod
    def parse(cls, path: str) -> ArticleModel:
        """Parse article from file and store them in ArticleModel object.

        Arguments:
            path {str} -- article file location.
        Returns:
            ArticleModel -- article object.
        """
        pass

In [None]:
class JsonArticleIngestor(ArticleIngestorInterface):
    """Object that ingests JSON file into article model."""

    allowed_extensions: List[str] = ['json']

    @classmethod
    def parse(cls, path: str) -> ArticleModel:
        """Parse article from file and store them in ArticleModel object.
        
        Arguments:
            path {str} -- article file location.
        Returns:
            ArticleModel -- article object.
        """
        if not cls.can_ingest(path):
            raise Exception(f'Cannot ingest exception for {path}')

        with open(path, 'r', encoding='utf-8-sig') as fh:
            article_json = json.load(fh)
            title = article_json.get('metadata', {}).get('title', '')
            abstract = ' '.join([
                p.get('text', '') 
                for p in article_json.get('abstract', '')])
            conclusion = ' '.join([
                p.get('text', '')
                for p in article_json.get('body_text', '')
                if 'conclusion' in p.get('section', '').lower()])
            article = ArticleModel(paper_id=article_json.get('paper_id', ''),
                                   title=title.strip(),
                                   abstract=abstract.strip(),
                                   conclusion=conclusion.strip())

        return article

In [None]:
# check one sample json 
sample_json_article_path = os.path.join(
    publication_directory,
    'custom_license',
    'custom_license',
    'pdf_json',
    '0a52a3d2793f8ca8a4d6f6630e986ea1da115f80.json')

In [None]:
sample_article = JsonArticleIngestor.parse(sample_json_article_path)

In [None]:
#check a sample article to show its content: paper_id, title, abstract and conclusion (if applies)
sample_article

In [None]:
#check normalized keywords, it will show how much weight this sample article has for each category we created 
sample_article.get_normalized_keywords_frequency(keywords_classes)

In [None]:
#chek assigned label with score
sample_article.assign_label(keywords_classes)

### 4.4.7 Ingest articles

In [None]:
def ingest_articles(publications: List[PublicationModel],keywords_classes: Dict[str, List[str]]) -> int:
    """Ingest articles.

    Arguments:
        publications {List[PublicationModel]} -- list of publications.
        keywords_classes {Dict[str, List[str]]} -- keywords classes.
    Returns:
        int -- total articles.
    """
    total_articles: int = 0
    for publication in publications:
        publication.articles = []
        article_filenames = publication.article_filenames
        for article_filename in article_filenames:
            # Get and parse article
            article_path = os.path.join(publication.directory,
                                        publication.name, publication.name,
                                        'pdf_json',
                                        article_filename)
            article = JsonArticleIngestor.parse(article_path)
            article.assign_label(keywords_classes)
            publication.articles.append(article)
            total_articles += 1
            
    return total_articles

In [None]:
%%time
total_articles = ingest_articles(publications, keywords_classes)

In [None]:
total_articles

### 4.4.8 Calculate class Frequency

- to calculate the frequencies of each assigned class

In [None]:
def get_classes_frequency(
    publications: List[PublicationModel],
    keywords_classes: Dict[str, List[str]]) -> (int, Dict[str, int]):
    """Get classes frequency.

    Arguments:
        publications {List[PublicationModel]} -- list of publications.
        keywords_classes {Dict[str, List[str]]} -- keywords classes.
    Returns:
        Dict[str, int] -- class and the frequency pairs.
    """
    classes_frequency: Dict[str, int] = OrderedDict()
    for class_name, _ in keywords_classes.items():
        classes_frequency[class_name] = 0
    
    for publication in publications:
        for article in publication.articles:
            # Get class label
            label = article.label_nfreq[0]
            if label:
                classes_frequency[label] += 1
            
    return classes_frequency

In [None]:
%%time
classes_frequency = get_classes_frequency(publications, keywords_classes)

In [None]:
classes_frequency

In [None]:
total_labeled_articles = sum([freq for _, freq in classes_frequency.items()])

In [None]:
total_labeled_articles

### 4.4.9 Build data set using data generator API in TensorFlow 2.0 

In [None]:
training_data_ratio = CL_TRAINING_DATA_RATIO #use 80% of data as training
print(training_data_ratio)

In [None]:
def build_dataset(publications: List[PublicationModel],keywords_classes: Dict[str, List[str]]):
    """
    Build dataset with text and label pairs.

    Arguments:
        publications {List[PublicationModel]} -- list of publications.
        keywords_classes {Dict[str, List[str]]} -- keywords classes.
    Returns:
        (tf.data.Dataset, tf.data.Dataset) -- labeled and unlabeled dataset.
    """
    training_texts: List[str] = []
    training_articles: List[ArticleModel] = []
    validation_texts: List[str] = []
    validation_articles: List[ArticleModel] = []
    testing_texts: List[str] = []
    testing_articles: List[ArticleModel] = []

    for publication in publications:
        for article in publication.articles:
            # Combine abstract and conclusion into text
            text = []
            if article.abstract:
                text.append(article.abstract)
            if article.conclusion:
                text.append(article.conclusion)
            if not text:
                continue
            text = ' '.join(text)
            # Get class label
            label = article.label_nfreq[0]
            if label in keywords_classes:
                if np.random.random() <= training_data_ratio:
                    # Training
                    training_texts.append((text, label))
                    training_articles.append(article)
                else:
                    # Validation
                    validation_texts.append((text, label))
                    validation_articles.append(article)
            else:
                # Testing
                testing_texts.append(text)
                testing_articles.append(article)

    training_dataset = tf.data.Dataset.from_generator(
        lambda: training_texts,
        (tf.string, tf.string),
        (tf.TensorShape([]), tf.TensorShape([])))
    validation_dataset = tf.data.Dataset.from_generator(
        lambda: validation_texts,
        (tf.string, tf.string),
        (tf.TensorShape([]), tf.TensorShape([])))
    testing_dataset = tf.data.Dataset.from_generator(
        lambda: testing_texts,
        (tf.string),
        (tf.TensorShape([])))

    return (training_dataset, training_articles,
            validation_dataset, validation_articles,
            testing_dataset, testing_articles)

In [None]:
%%time
(training_dataset, training_articles,validation_dataset, validation_articles,testing_dataset, testing_articles)\
    = build_dataset(publications,label_str_to_int)

In [None]:
# check if it workds, takes 2 items from train_dataset
for i,(text, label) in enumerate(training_dataset.take(2)):
#     if label=='woman':
        print(i,'\n')
        print(f'text:\n{text}\n')
        print(f'label:\n{label}')
        print('='*100)

In [None]:
len(training_articles)

### 4.4.10 Data pipeline 

- Tokenize the data

- get sequence for sentences

- pad sequences for sentences

In [None]:
def tokenize(words: str) -> List[str]:
    """Tokenize text of words into a list of individual words.

    Arguments:
        words {srt} -- tensor of words.
    Returns:
        List(str) -- list of words.
    """
    return text_to_word_sequence(words)

def get_sequence_lengths(dataset) -> List[int]:
    """Get sequence length of all texts.

    Arguments:
        dataset {tf.data.Dataset} -- texts dataset.
    Returns:
        List(int) -- sequence lengths.
    """
    sequence_lengths: List[int] = []
    for text in dataset:
        if type(text) == tuple:
            # Remove label
            text, _ = text

        if type(text) is not str:
            # Convert tensor to string
            text = text.numpy().decode('utf-8')
        sequence_lengths.append(len(tokenize(text)))
        
    return sequence_lengths

In [None]:
# for training set
tic = time.time()
training_sequence_lengths = get_sequence_lengths(training_dataset)
toc = time.time()
print(f'Runtime: {int(toc-tic)} seconds')

In [None]:
# for validation set
tic = time.time()
validation_sequence_lengths = get_sequence_lengths(validation_dataset)
toc = time.time()
print(f'Runtime: {int(toc-tic)} seconds')

In [None]:
# for test set
tic = time.time()
testing_sequence_lengths = get_sequence_lengths(testing_dataset)
toc = time.time()
print(f'Runtime: {int(toc-tic)} seconds')

### 4.4.11 word count 

- build and save word to index dictionary under working space

In [None]:
working_directory = CL_WORKING_DIRECTORY
print(working_directory)

In [None]:
if not os.path.isdir(working_directory):
    print(f'Creating directory... {working_directory}')
    os.makedirs(working_directory)

In [None]:
words_path = os.path.join(working_directory, 'words.txt') # will save tokenization into words.txt

In [None]:
def perform_word_count(publications: List[PublicationModel],words_path: str):
    """Perform word count, sort it and save it to file."""
    print(f'Performing word count, sorting and saving to file...')
    tic = time.time()
    
    word_count = Counter()
    sorted_word_count: List[str, int] = []
    # Perform word count of dataset
    text = []
    for publication in publications:
        for article in publication.articles:
            # Combine abstract and conclusion into text
            if article.abstract:
                text.append(article.abstract)
            if article.conclusion:
                text.append(article.conclusion)
    text = ' '.join(text)
    word_count = Counter(tokenize(text))
    # Sort word count.
    sorted_word_count = word_count.most_common()
    # Save words of sorted word count to file
    sorted_words, sorted_counts = zip(*sorted_word_count)
    with open(words_path, 'w',encoding="utf-8") as fh:
        for word in sorted_words:
            fh.writelines(word + '\n')

    toc = time.time()
    print(f'Word count done. Runtime: {int(toc-tic)} seconds.')
    
    return word_count, sorted_word_count

In [None]:
word_count, sorted_word_count = perform_word_count(publications, words_path)

In [None]:
total_unique_words = len(word_count)

In [None]:
total_unique_words

In [None]:
for cls, keywords in keywords_classes.items():
    print(f'{cls}:')
    for keyword in keywords:
        print(f'    {keyword} has {word_count.get(keyword, 0)} words')

In [None]:
total_words = sum([count for word, count in sorted_word_count])

In [None]:
total_words

In [None]:
sorted_words, sorted_counts = zip(*sorted_word_count)

> ### 4.4.12 Words dictionary

- word 2 index 

- index 2 word

In [None]:
max_vocabulary_size = CL_MAX_VOCABULARY_SIZE
print(max_vocabulary_size)

In [None]:
def generate_words_dictionaries(sorted_word_count):
    """Generate integer to word dictionary and vice versa."""
    print(f'Generating dictionaries...')
    sorted_words, sorted_counts = zip(*sorted_word_count)
    # Dictionary is up to maximum vocabulary size
    if len(sorted_words) > max_vocabulary_size:
        sorted_words = sorted_words[:max_vocabulary_size]

    # Vocabulary size with unknown
    vocabulary_size = len(sorted_words) + 1

    # Generate dictionaries
    int_to_word = {i+1: word for i, word in enumerate(sorted_words)}
    int_to_word[0] = '<UNK>'
    word_to_int = {word: i for i, word in int_to_word.items()}
    
    return int_to_word, word_to_int, vocabulary_size

In [None]:
int_to_word, word_to_int, vocabulary_size = generate_words_dictionaries(sorted_word_count)

In [None]:
assert(int_to_word[word_to_int['the']] == 'the')

In [None]:
assert(word_to_int[int_to_word[len(sorted_word_count)-1]] == len(sorted_word_count)-1)

### 4.4.13 Preprocess DataSet

In [None]:
sequence_length = CL_SEQUENCE_LENGTH
print(sequence_length)

In [None]:
def preprocessing(text: str, label: str) -> (List[int], int):
    """Pad/turncate and encode text of words from strings into integers.

    Arguments:
        text {str} -- input text.
        label {str} -- input label.
        
    Returns:
        (List[int], int) -- features and label in integers.
    """
    if type(text) is not str:
        # Convert tensor to string
        text = text.numpy().decode('utf-8')
    # Tokenize text
    tokenized_text = tokenize(text)
    # Turncate text
    if len(tokenized_text) >= sequence_length:
        features = tokenized_text[:sequence_length]
    # Pad text
    else:
        features = ['<UNK>'] * (sequence_length-len(tokenized_text))
        features += tokenized_text
    # Encode text
    features_int = [word_to_int.get(feature, 0) for feature in features]
    features_int = np.array(features_int)
    
    # Encode Label
    if label != '':
        if type(label) is not str:
            # Convert tensor to string
            label = label.numpy().decode('utf-8')
            
        if label in label_str_to_int:
            label_int = label_str_to_int[label]
        else:
            raise Exception(f'{label} is not in {label_str_to_int}')
        labels_float = [0.0] * total_classes
        labels_float[label_int] = 1.0
        labels_float = np.array(labels_float)
    else:
        labels_float = np.array([])

    return features_int, labels_float

In [None]:
features_int, labels_float = preprocessing(text=' '.join(['the'] * (sequence_length-26)), label='lung disease')

### 4.4.14 Build data generator for training a NN 

In [None]:
num_parallel_calls = CL_NUM_PARALLEL_CALLS
shuffle_buffer_size = CL_SHUFFLE_BUFFER_SIZE

### For training data

In [None]:
training_preprocessing = lambda x, y: tf.py_function(preprocessing,(x, y), ((tf.int32), tf.float32))
training_features = (
    training_dataset
    .shuffle(shuffle_buffer_size)
    .map(training_preprocessing,
         num_parallel_calls=num_parallel_calls)
    .batch(len(training_articles))
    .prefetch(buffer_size=num_parallel_calls)
    .cache()
    .repeat()
)
training_features_iter = iter(training_features)
features_int, labels_float = training_features_iter.get_next()
assert(features_int.shape == (len(training_articles), sequence_length))
assert(labels_float.shape == (len(training_articles), total_classes))

### For validation data 

In [None]:
validation_preprocessing = lambda x, y: tf.py_function(preprocessing,(x, y), ((tf.int32), tf.float32))
validation_features = (
    validation_dataset
    .shuffle(shuffle_buffer_size)
    .map(validation_preprocessing,
         num_parallel_calls=num_parallel_calls)
    .batch(len(validation_articles))
    .prefetch(buffer_size=num_parallel_calls)
    .cache()
    .repeat()
)
validation_features_iter = iter(validation_features)
features_int, labels_float = validation_features_iter.get_next()
assert(features_int.shape == (len(validation_articles), sequence_length))
assert(labels_float.shape == (len(validation_articles), total_classes))

### For test data 

In [None]:
testing_preprocessing = lambda x: tf.py_function(preprocessing,(x, ''), ((tf.int32), (tf.float32)))
testing_features = (
    testing_dataset
    .map(testing_preprocessing,
         num_parallel_calls=num_parallel_calls)
    .batch(len(testing_articles))
    .prefetch(buffer_size=num_parallel_calls)
    .cache()
    .repeat()
)
testing_features_iter = iter(testing_features)
features_int, labels_float = testing_features_iter.get_next()
features_int, labels_float = testing_features_iter.get_next()
assert(features_int.shape == (len(testing_articles), sequence_length))
assert(labels_float.shape == (len(testing_articles), 0))

### 4.4.15 Create RNN model based on GRU 

The model structure

- Embedding layer

- Bidirectional GRU layer

- Dense Layer

In [None]:
embedding_dim = CL_EMBEDDING_DIM
print(embedding_dim)

In [None]:
activation_layer_units = CL_ACTIVATION_LAYER_UNITS
print(activation_layer_units)

In [None]:
# create RNN model based on bidirectional GRU layer
model = keras.Sequential([
    keras.layers.Embedding(vocabulary_size,
                           embedding_dim,
                           embeddings_initializer='uniform'),
    keras.layers.Bidirectional(keras.layers.GRU(128, return_sequences=True)),
    keras.layers.GlobalAveragePooling1D(),
    keras.layers.Dense(32, activation='relu'),

    keras.layers.Dense(total_classes)
])

In [None]:
model.summary()

In [None]:
model.compile(optimizer='adam',
              loss=tf.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
model_directory = os.path.join(working_directory, 'models')
model_directory

In [None]:
if not os.path.isdir(model_directory):
    print(f'Creating directory... {model_directory}')
    os.makedirs(model_directory)

In [None]:
model_checkpoint_directory = os.path.join(model_directory,
                                          'checkpoint')
model_checkpoint_directory

In [None]:
if not os.path.isdir(model_checkpoint_directory):
    print(f'Creating directory... {model_checkpoint_directory}')
    os.makedirs(model_checkpoint_directory)

In [None]:
model_architecture_name = CL_MODEL_ARCHITECTURE_NAME
model_architecture_name

In [None]:
data_name = CL_DATA_NAME
data_name

In [None]:
model_checkpoint_file_name = "%s_%s.h5" % (
    model_architecture_name, data_name)
model_checkpoint_file_name

In [None]:
model_checkpoint_path = os.path.join(model_checkpoint_directory,
                                     model_checkpoint_file_name)
model_checkpoint_path

In [None]:
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=model_checkpoint_path,
    monitor='val_accuracy', verbose=1, 
    save_best_only=True, save_weights_only=False,
    mode='max', save_freq='epoch')

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=2,
                                                           verbose=1,
                                                           mode='auto')
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=1,
                                                          verbose=0, mode='auto', min_delta=0.0001, cooldown=0, min_lr=0)

callbacks = [checkpoint,early_stopping,reduce_lr]

> ### 4.4.17 Train the model 

In [None]:
batch_size = CL_BATCH_SIZE
print(batch_size)

In [None]:
epochs = CL_EPOCHS
print(epochs)

In [None]:
num_workers = CL_NUM_WORKERS
print(num_workers)

In [None]:
#create iterators
training_features_int, training_labels_float =  training_features_iter.get_next()
validation_features_int, validation_labels_float = validation_features_iter.get_next()

In [None]:
history = model.fit(x=training_features_int, y=training_labels_float, 
                    batch_size=batch_size,
                    epochs=4,
                    callbacks=callbacks,
                    validation_data=(validation_features_int,
                                     validation_labels_float),
                    workers=num_workers,
                    use_multiprocessing=True,                   
                   )

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
loss = history_dict['loss']
val_loss = history_dict['val_loss']
hist_epochs = range(1, len(loss) + 1)

plt.plot(hist_epochs, loss, 'b-.', label='Training Loss')
plt.plot(hist_epochs, val_loss, 'r-.', label='Validation Loss')
plt.title('Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
hist_epochs = range(1, len(acc) + 1)

plt.plot(hist_epochs, acc, 'b--', label='Training Accuracy')
plt.plot(hist_epochs, val_acc, 'r--', label='Validation Accuracy')
plt.title('Training Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

### 4.4.18 Evaluate and prediction

In [None]:
model.load_weights(model_checkpoint_path)

In [None]:
loss, accuracy = model.evaluate(
    x=validation_features_int, y=validation_labels_float,
    batch_size=batch_size,
    workers=num_workers,
    use_multiprocessing=True)

In [None]:
print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')

In [None]:
probability_model = tf.keras.Sequential([model,tf.keras.layers.Softmax()])

In [None]:
def make_prediction(dataset, dataset_size, preprocessing, model):
    # Make data pipeline
    features = (
        dataset
        .map(preprocessing,
             num_parallel_calls=num_parallel_calls)
        .batch(dataset_size)
        .prefetch(buffer_size=num_parallel_calls)
        .cache()
        .repeat()
    )
    features_iter = iter(features)
    features_int, _ = features_iter.get_next()
    # Make prediction
    predictions = model.predict(
        x=features_int,
        batch_size=batch_size,
        verbose=1,
        workers=num_workers, use_multiprocessing=True)
    
    return predictions

In [None]:
def plot_prediction_result_per_class(predictions):
    predictions_t = predictions.transpose()
    # Argsort predictions
    sorted_predictions_indices = np.array([
        sorted_prediction[::-1] 
        for sorted_prediction in np.argsort(predictions_t)
    ])
    # Plot
    labels = list(label_str_to_int.keys())
    fig = plt.figure()
    ax = fig.add_subplot(111)
    for i, sorted_predictions_index in enumerate(sorted_predictions_indices):
        value = [predictions_t[i][j] for j in sorted_predictions_index]
        ax.plot(value, label=labels[i])
    plt.legend()
    plt.title('Sorted Prediction Result per Class')
    plt.xlabel('Articles')
    plt.ylabel('Prediction Result')
    plt.show()

In [None]:
def print_top_articles_per_class(predictions, articles, top_k=3):
    predictions_t = predictions.transpose()
    # Argsort predictions
    sorted_predictions_indices = np.array([
        sorted_prediction[::-1] 
        for sorted_prediction in np.argsort(predictions_t)
    ])
    # Print top articles
    labels = list(label_str_to_int.keys())

 

    for i, sorted_predictions_index in enumerate(sorted_predictions_indices):
#         if i==3:
#             print(sorted_predictions_index)
        print("*" * 100)
        print(f'category: {labels[i]}')
        print("*" * 100)
        for j in sorted_predictions_index[:top_k]:
            print(f'Score: {predictions_t[i][j]:.2f}')
            print(articles[j])
            print('\n')
            print("=" * 100)

In [None]:
def print_top_articles_per_class(predictions, articles, top_k=3):
    predictions_t = predictions.transpose()
    # Argsort predictions
    sorted_predictions_indices = np.array([
        sorted_prediction[::-1] 
        for sorted_prediction in np.argsort(predictions_t)
    ])
    # Print top articles
    labels = list(label_str_to_int.keys())

 

    for i, sorted_predictions_index in enumerate(sorted_predictions_indices):
#         if i==3:
#             print(sorted_predictions_index)
        print("*" * 100)
        print(f'category: {labels[i]}')
        print("*" * 100)
        for j in sorted_predictions_index[:top_k]:
            print(f'Score: {predictions_t[i][j]:.2f}')
            print(articles[j])
            print('\n')
            print("=" * 100)

In [None]:
training_dataset_predictions = make_prediction(
    dataset=training_dataset,
    dataset_size=len(training_articles),
    preprocessing=training_preprocessing,
    model=probability_model
)

In [None]:
print_top_articles_per_class(
    predictions=training_dataset_predictions,
    articles=training_articles,
    top_k=5
)

In [None]:
valid_dataset_predictions = make_prediction(
    dataset=validation_dataset,
    dataset_size=len(validation_articles),
    preprocessing=validation_preprocessing,
    model=probability_model
)

In [None]:
print_top_articles_per_class(
    predictions=valid_dataset_predictions,
    articles=validation_articles,
    top_k=5
)

In [None]:
test_dataset_predictions = make_prediction(
    dataset=testing_dataset,
    dataset_size=len(testing_articles),
    preprocessing=testing_preprocessing,
    model=probability_model
)

In [None]:
print_top_articles_per_class(
    predictions=test_dataset_predictions,
    articles=testing_articles,
    top_k=5
)

In [None]:
top_k=5
     
print_top_articles_per_class.predictions_t = training_dataset_predictions.transpose()
predictions_t = training_dataset_predictions.transpose()
sorted_predictions_indices = np.array([
    sorted_prediction[::-1] 
    for sorted_prediction in np.argsort(predictions_t)
])
    
print_top_articles_per_class.labels = list(label_str_to_int.keys())
labels = list(label_str_to_int.keys())

 

for i, sorted_predictions_index in enumerate(sorted_predictions_indices):
    
    display(Markdown('# <span style="color:#1174E6;font-weight:bold;font-size: 20pt">List of Articles </span>'))
    for j in sorted_predictions_index[:top_k]:
        #print(f'Score: {predictions_t[i][j]:.2f}')           
        #print(training_articles[j])
        display(Markdown('# <span style="color:#FADA5;font-weight:bold;font-size: 10pt"> Sub Task'+str(i+1)+ ': '+ str(labels[i]) +'</span>'))
        display(Markdown('# <span style="color:#FADA5;font-weight:bold;font-size: 10pt"> Weightage'+ ': '+ str(predictions_t[i][j]) +'</span>'))
        display(Markdown('# <span style="color:#FADA5;font-weight:bold;font-size: 10pt"> Paper Id'+ ': '+ str(training_articles[j].paper_id) +'</span>'))
        display(Markdown('# <span style="color:#FADA5;font-weight:bold;font-size: 10pt"> Title'+ ': '+ str(training_articles[j].title) +'</span>'))
        display(Markdown('# <span style="color:#FADA5;font-size: 10pt;font-weight:normal"><p style="font-weight:bold;">Abstract'+ ': '+ '</p> '+ str(training_articles[j].abstract) +'</span>'))
        display(Markdown('# <span style="color:#FADA5;font-size: 10pt;font-weight:normal"><p style="font-weight:bold;">Conclusion'+ ': '+ '</p> '+ str(training_articles[j].conclusion) +'</span>'))