# <center> CORD-19 Challenge
# <center> TASK 9
    
![Task%209%20banner.JPG](attachment:Task%209%20banner.JPG)
    
## The Critical Challenge
The 2019 novel coronavirus (COVID-19) has caused a public health crisis across the U.S. and around the world. The number of related research publications is increasing at a rapid pace and consequently, the best practices on how to address the COVID-19 crisis are quickly evolving. 
How do medical communities, researchers, and health care policy makers quickly and easily find the most current and accurate research related to intersectoral collaboration and information sharing regarding COVID-19, so that they can focus their valuable time on developing protocols, policies, and vaccines to address the crisis?

## Summary of our Solution
This notebook provides an AI-powered literature review of CORD-19 (COVID-19 Open Research Dataset) focused on the question of “what has been published about information sharing and intersectoral collaboration?” We have developed an Intelligent Publication Retrieval Engine (IPRE) using NLP text and data mining methods to generate four interactive ways for users to quickly and easily locate the most relevant articles that address intersectoral collaboration and information regarding COVID-19. The results are dynamic to provide the most current publications from the current contents of CORD-19 with regard to these sub tasks:
    
* Methods for coordinating data-gathering with standardized nomenclature.
* Sharing response information among planners, providers, and others.
* Understanding and mitigating barriers to information-sharing.
* How to recruit, support, and coordinate local (non-Federal) expertise and capacity relevant to public health emergency response (public, private, commercial and non-profit, including academic).
* Integration of federal/state/local public health surveillance systems.
* Value of investments in baseline public health response infrastructure preparedness
* Modes of communicating with target high-risk populations (elderly, health care workers).
* Risk communication and guidelines that are easy to understand and follow (include targeting at risk populations’ families too).
* Communication that indicates potential risk of disease to all population groups.
* Misunderstanding around containment and mitigation.
* Action plan to mitigate gaps and problems of inequity in the Nation’s public health capability, capacity, and funding to ensure all citizens in need are supported and can access information, surveillance, and treatment.
* Measures to reach marginalized and disadvantaged populations.
* Data systems and research priorities and agendas incorporate attention to the needs and circumstances of disadvantaged populations and underrepresented minorities.
* Mitigating threats to incarcerated people from COVID-19, assuring access to information, prevention, diagnosis, and treatment.
* Understanding coverage policies (barriers and opportunities) related to testing, treatment, and care
We have developed an Intelligent Publication Retrieval Engine (IPRE) using NLP text and data mining methods to generate four interactive ways for users to quickly and easily locate the most relevant articles that address intersectoral collaboration and information regarding COVID-19. The results are dynamic to provide the most current publications from the current contents of CORD-19.

## Summary of Key Insights from Our Solution

As part of our validation process, each team member reviewed the responses to each of the subtasks. Here are some of the insights we gleaned from our validation of the most relevant articles on intersectoral collaboration and information. These are key sentences from the highest ranking articles resulting from our Intelligent Publication Retrieval Engine (IPRE):
    
_WeChat Chinese social network platform is used by IMWs in Hong Kong and Macau  for sharing key health messages and    official information to the community and providing one another with emotional support_
     
_However the power of community might prove to be crucial during this epidemic. Family elders and religious leaders have major role as health-care providers for both adult and adolescent members of the African NORP community.
African NORP migrants residing in Wuhan Hubei GPE province-the epicenter of the outbreak-might be more worried than ever. Community unity is the primary strategy in coping with barriers to health care-eg the Ghanaian NORP community in Guangzhou GPE Guangdong GPE province made monetary donations and arranged health care for their community. 
African NORP community organisations also compile and manage information about visits of health-care providers from Africa LOC for their members and encourage these visiting specialists to informally consult on voluntary basis_

## Notebook Contents
The remainder of this notebook includes:
* **Our Overall Approach**
* **Pros and Cons of Our Approach**
* **Our Solution: Code and Results**
* **Acknowledgement and Licenses**

## Our Overall Approach
Because the volume of text available in the COVID-19 dataset is so large and spans such a wide scope of topics, we used a multi-step, iterative approach to find the most relevant articles and make them available in an easy way to quickly bring current and critical information to serve the needs of our healthcare and research communities.
The picture below shows the overall process we used to develop our solution. A detailed description of each step follows the picture.

![Overall%20Process%20Updated.jpg](attachment:Overall%20Process%20Updated.jpg)

### Summary of Approach
1. We started by preparing the COVID-19 dataset for efficient processing and analysis. This included consolidating data sources, keeping the most relevant columns of data, and then cleansing the data for easier analysis and input to the data models.
2. Next, we minimized the number of articles to analyze by eliminating those that were not relevant to the subtasks on which we were focused. 
3. From there, we identified the dominant topics across the subtasks and used them to identify the most relevant articles for each sub task
4. Lastly, we identified the top three sentences across the top ranked articles to provide an important and quick glimpse into the relevancy of the articles relative to the sub tasks.
5. We display the responses to the sub tasks in four interactive ways. Two are shown here:

    **Scatter Plot** -- A scatter plot is an interactive way to find top sentences from relevant articles that are responses to the subtasks.
    We show the results in three categories 'High', 'Medium' and 'Low' based on their scores. We then compare the resulting text between the 'High' category of articles to the rest. 
   * The ScatterTextPlot shows the frequency of words between these two categories within four quadrants on the plot. 
   * The upper left corner are words that were found frequently in 'High' scoring articles, but not so much in others i.e. 'Medium' and 'Low'. 
   * In the lower right corner are words occurring frequently in lower scoring articles. 
   * The bottom left corner are words infrequent in both sets of articles.
   * You can also use the Scatter Plot to get responses based on key word searches. You can search for any word or simply click a word on the plot and immediately see the exact sentences where those words are part of the articles. For example, in the Search bar, you could search for the text 'Information' and it would auto complete with any further words it found in the articles, like ' Information sharing'. 
   * The search results will show all sentences where 'Information sharing' is mentioned. 
     Here is an actual result related to "Geographic Location" in the context of the overall sub task about information sharing:
                                          
![Scatter2_1.gif](attachment:Scatter2_1.gif)
   
    
    
*    **t-SNE Plot** (T-distributed Stochastic Neighbor Embedding) – The dominant topics related to the subtasks are clustered together. You can click on a topic to see related articles. If you find one article to be relevant to a topic, you can then choose to see other articles that address that topic. 
        Here is an actual result that shows a cluster of articles on the same search text. Hovering above any of the dots displays the information about that article.

        
![TSNE1.JPG](attachment:TSNE1.JPG)
        
    
    

### 1. Prepare the Dataset for Processing
As with any data analysis project, the first step is to clean the data to eliminate unnecessary words and punctuation and remove any data not needed for the final solution for more efficient and accurate processing and model input, which included:
* **Reducing the size of source datasets** to include only the data needed for final output
     * Integrating directly with Kaggle APIs to download latest available dataset and parse it to import it to our processes so that we have one csv dataset that includes all of what we need to make the most relevant articles available for each sub task.
     * Bringing together all of the json files, which contain the articles, with meta data from downloaded datasets and convert it to one csv file.
     * Opening each JSON file (containing one article) and extract the valuable text from it so we have all of the data in one place, including:
        *   Dropping duplicate rows based on title, author, etc. 
        *   Dropping duplicate columns – keeping 22 columns containing the most needed data for the final solution to save on processing time 
        *   Merging all files with meta data (publish date, author, etc.) based on selected unique fields 
* **Eliminating unneeded data from dataset**
     *   Removing punctuation, digits, whitespace, stop words (library of words in English that are common words and not relevant to the article) 
     *   Making all words lowercase for exact matching of words
     *   Tokenizing words so that each word can be used as an independent entity; each word is a token and additional preprocessing can be done; for example, bigrams and trigrams to breakdown words to be passed into subsequent algorithms

### 2.	Train the Model
Our next step was to find those articles most relevant to the subtasks within Task 9. To do this, we used BM25 (BM=Best Match 25=25th iteration of the computations used within it). BM25 processes all of the text in all of the articles in the source dataset and creates an index of keywords and related articles. This is how BM25 is trained on the COVID-19 data. BM25 ranks the keywords based on the occurrences of keywords found in each article to create a ranked index.

### 3.	Topic Modeling
With a ranked index of articles based on keywords, the next step is to match the most relevant articles to the subtasks. The relevant articles are passed to the LDA Topic Modeling algorithm (latent Dirichlet allocation model). LDA identifies the dominant topics found in those articles. A coherency score is calculated to indicate the degree to which the topic is related to the article.  The higher the score, the greater the relevance. This output is used to create the cluster of articles by dominant topics shown in the t-SNE plot as one of the four responses to the subtasks. 

### 4. Sentence Smoothing
To provide deeper insight into the top ranked articles for each subtask, we used a trained lanuguage model, BERT (Bidirectional Encoder Representations from Transformers), to provide the top three sentences from among the short list of most relevant articles to give the user more specific information about the contents of the most relevant articles for each subtask.

# Pros and Cons of the Approach
**Pros**
* Our approach provides visual and interactive ways to quickly see key information about relevant publications and articles for multiple subtasks and provides the opportunity to go directly to the source, when the user wants to see more detailed information.
* The use of t-SNE plots and Wall charts provide two different ways to quickly see clusters of keywords and their relationships among other high value and low frequency topics again, with the opportunity to go directly to the source for more detail.
* Providing the top three sentences of the top ranked articles for each sub task gives a quick glimpse to see if the publication is releant for the users' need

**Cons**
* Our approach uses a fixed set of keywords to identify key insights from the articles in relation to a sub task. This method could possibly exclude other relevant insights.
* We chose to start our approach by using the TF-ID word frequency model to create clusters of articles to find those related to our task 9 subtasks so we could then focus on that cluster of articles.
* We quickly saw that no clear clusters were formed. With 35,000 plus articles, there were not enough relationships among the keywords from the articles to form clear clusters.
* Our new focus was on how to filter or shrink the number of articles to work with to just those relevant to our subtasks. 
* With a reduced number of articles to work with, it became easier to focus on getting unique topics and sentences to provide relevant responses to the subtasks. However, this could also exclude relevant insights from articles filtered out.


# Packages used within the scope of this analysis

In [None]:
# Packages used within the scope of this analysis
!pip install rank_bm25
!pip install pyLDAvis
!pip install nltk 
!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz
!pip install spacy

!pip install scispacy
!pip install wordcloud
!pip install kneed
!pip install torch
!pip install sentence_transformers
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
!pip install scattertext
!pip install gensim
!pip install yattag
!pip install bokeh
!pip install interact

In [None]:
# Modules that need to be imported for this analysis

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download("punkt")
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import string
import re
import numpy as np

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary
#Create Biagram & Trigram Models 
from gensim.models import Phrases

# spacy for lemmatization
import spacy

#sklearn
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#bm25 imports
from rank_bm25 import BM25Okapi

#nltk
import nltk
from nltk.corpus import stopwords
from textblob import Word
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize, sent_tokenize , RegexpTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import wordnet 
from wordcloud import WordCloud, STOPWORDS

from difflib import SequenceMatcher , get_close_matches, Differ
from sentence_transformers import SentenceTransformer
import scipy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

from kneed import KneeLocator
import matplotlib.colors as mcolors
import seaborn as sns

from pandas import Panel
from tqdm import tqdm_notebook as tqdm

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

from pprint import pprint

import en_core_sci_md
import scattertext as st
import en_core_web_sm
from IPython.display import IFrame
# from kaggle.api.kaggle_api_extended import KaggleApi
import glob
import json

from IPython.display import HTML
from yattag import Doc, indent

from sklearn.manifold import TSNE
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper, CustomJS
from bokeh.palettes import Category20
from bokeh.transform import linear_cmap
from bokeh.io import output_file, show
from bokeh.transform import transform
from bokeh.io import output_notebook
from bokeh.plotting import figure
from bokeh.layouts import column
from bokeh.models import RadioButtonGroup
from bokeh.models import TextInput
from bokeh.layouts import gridplot
from bokeh.models import Div
from bokeh.models import Paragraph
from bokeh.layouts import column, widgetbox
from gensim import corpora, models
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.models import HoverTool, CustomJS, ColumnDataSource, Slider
from bokeh.layouts import column
from bokeh.palettes import all_palettes
from ipywidgets import interact, interactive, fixed, interact_manual
import gc
import os
import pickle

# Step 1: Collect COVID-19 articles from Kaggle

The latest allen-institute-for-ai/CORD-19-research-challenge dataset from Kaggle is pulled using the Kaggle API. 
To perform this, a Kaggle account is required.
The process involves building a data extraction pipeline, where data is pulled from four source directories and a metadata file.
Here are the files:
- BIORXIV_MEDRXIV
- COMMON_USE_SUB
- NON_COMMON_USE_SUB
- CUSTOM_LICENSE
- metadata.csv

In [None]:
# The files pulled from different directories 
# ## Metadatafile
meta_file = '/kaggle/input/CORD-19-research-challenge/metadata.csv'
meta_df = pd.read_csv(meta_file)

# ## 4 json files
bio_path =  '/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/'
comm_path = '/kaggle/input/CORD-19-research-challenge/comm_use_subset/'
non_comm_path = '/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/'
custom_path = '/kaggle/input/CORD-19-research-challenge/custom_license/'

In [None]:
# Categories of journals
journals = {"BIORXIV_MEDRXIV": bio_path,
              "COMMON_USE_SUB" : comm_path,
              "NON_COMMON_USE_SUB" : non_comm_path,
              "CUSTOM_LICENSE" : custom_path}

# Step 2: Extract text from each article

We will use the text in the abstract and body of each research article to perform our analysis.

## Steps performed on each article.
- The extracted text is appended as a column to the original dataset that is loaded using standard pandas functions.
- The updated dataset is then saved as a csv file.
- Duplicate articles are removed, if the Document ID and title match between two or more articles.
- Valuable information from the meta data file is added to the dataframe eg. Publish date of the article.
- Duplicate columns between the JSON files and meta data are dropped and column headers renamed.

In [None]:
# Function to parse each json file and merge abstract and body text into a new column 'full_text'

def parse_each_json_file(file_path,journal):
    inp = None
    with open(file_path) as f:
        inp = json.load(f)
    rec = {}
    rec['document_id'] = inp['paper_id'] or None
    rec['title'] = inp['metadata']['title'] or None
    if inp.get('abstract'):
        abstract = "\n ".join([inp['abstract'][_]['text'] for _ in range(len(inp['abstract']) - 1)])
        rec['abstract'] = abstract or None
    else:
        rec['abstract'] = None
    full_text = []
    for _ in range(len(inp['body_text'])):
        try:
            full_text.append(inp['body_text'][_]['text'])
        except:
            pass

    rec['full_text'] = "\n ".join(full_text) or None
    rec['source'] =  journal     or None    
    return rec

In [None]:
# Function to merge extracted data from json files into a pandas dataframe 

def parse_json_and_create_csv(journals):
    journal_dfs = []
    cnt = 0
    for journal, path in journals.items():
        print(journal,path)
        parsed_rcds = []  
        json_files = glob.glob('{}/**/*.json'.format(path), recursive=True)
        for file_name in json_files:
            cnt = cnt + 1
            #print('processing {} file {}'.format(cnt,file_name))
            rec = parse_each_json_file(file_name,journal)
            parsed_rcds.append(rec)
        print("Total Records in list = {}".format(len(parsed_rcds)))
        df = pd.DataFrame(parsed_rcds)
        journal_dfs.append(df)
        #print(journal_dfs)
    return pd.concat(journal_dfs)

In [None]:
# Create a csv file with extracted data from json files

all_df = parse_json_and_create_csv(journals=journals)

# Save the dataframe into a csv:
#all_df.to_csv("covid19_latest.csv",index=False)

In [None]:
# Drop Duplicates for the df from 4 json files based on document id and title keys and from metadata file
all_df = all_df.drop_duplicates(subset=['document_id', 'title'])

# Drop Duplicates for the meta data df based on sha key
meta_df = meta_df.drop_duplicates(subset=['sha'])

In [None]:
# Display dimensions of text and metadata dataframes
print('all_df:',all_df.shape)
print('meta_df:',meta_df.shape)

In [None]:
# Merging the useful columns from metadata file with the final_all_df file

covid_df = pd.merge(left=all_df, right=meta_df, how='left', left_on='document_id', right_on='sha')
covid_df['publish_time'] = covid_df['publish_time'].fillna('1900')
covid_df['publish_time'] = covid_df['publish_time'].str[:4]
covid_df['publish_time'] = covid_df['publish_time'].astype(int)
covid_df.fillna("",inplace=True)

### Saving this final merged dataframe into csv:covid10_final
#merged_df.to_csv("covid19_final.csv",index=False)

In [None]:
%%time
# Duplicate columns between the json files and meta data are dropped and column headers renamed

covid_df=covid_df.drop(columns=[ 'title_y', 'abstract_y','source_x','sha', 'Microsoft Academic Paper ID','license', 'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse', 'full_text_file'])
covid_df=covid_df.rename(columns={"title_x":'title',"abstract_x":"abstract",'full_text':'body','document_id':'paper_id','source':'dataset'})

print('Dataset for before BM25 Scoring ',covid_df.shape)

In [None]:
print(covid_df.columns)
print('\n covid_df Shape:', covid_df.shape)

covid_df = covid_df[covid_df['body'].str.lower().str.contains('corona|sars|ncov|covid|ncovid|novel')]
print('\n covid_df after filtering Shape:', covid_df.shape)

covid_df= covid_df.drop_duplicates(subset=['title'])
print('\n covid_df after Title duplicate drop Shape:', covid_df.shape)

covid_df = covid_df.drop(['dataset', 'cord_uid', 'doi','pmcid', 'pubmed_id','journal'], axis = 1)
print('\n covid_df after columns drop Shape:', covid_df.columns)

covid_df = covid_df.reset_index(drop=True)

In [None]:
del [[meta_df,all_df]]
gc.collect()

# Step 3: Pre-process the data
Before we start our analysis on the dataset, we perform steps to prepare the data in the ALL_TEXT column so that it is better suited for text analysis. 

## Data cleasing steps
- Remove digits and punctuations
- Remove common English stop words eg. The, at, on etc
- Remove words that are 3 characters or less in length
- Replace white space characters
- Replace special characters eg. |,:,> etc.
- Make all text Lower case
- Tokenize each sentence into a list of words.
- Capture Bi-grams and Tri-grams:
    Bigrams are two words frequently occurring together in the document. Trigrams are three words frequently occurring. 
    Some examples in our example are: ‘back_bumper’, ‘oil_leakage’, ‘maryland_college_park’ etc.

In [None]:
# Pre-processing functions for cleaning text 

exclude_list = string.digits + string.punctuation
table = str.maketrans(exclude_list, len(exclude_list)*" ")
stop = stopwords.words('english')
english_stopwords = list(set(stop))
SEARCH_DISPLAY_COLUMNS = ['paper_id', 'title', 'body', 'publish_time', 'url', 'all_text']

nlp_x = en_core_web_sm.load()   

def clean_text(txt):    
    t = txt.replace("\\n",'')
    t = re.sub('\(|\)|:|,|;|\|’|”|“|\?|%|>|<', '', t )
    t = re.sub('/', ' ', t)
    t = t.replace('\n','')
    t = t.replace('  ','')
    t = t.replace("[",'')
    t = t.replace("]",'')
    t = ' '.join([word for word in t.split() if len(word)>1 ])
    t = sent_tokenize(t)
    return t

def preprocess_with_ngrams(docs):
    # Add bigrams and trigrams to docs,minimum count 10 means only that appear 10 times or more.
    bigram = Phrases(docs, min_count=5)
    trigram = Phrases(bigram[docs])

    for idx in range(len(docs)):
        for token in bigram[docs[idx]]:
            if '_' in token:
                # Token is a bigram, add to document.
                docs[idx].append(token)
        for token in trigram[docs[idx]]:
            if '_' in token:
                # Token is a trigram, add to document.
                docs[idx].append(token)
    return docs

class SearchResults:
    
    def __init__(self, 
                 data: pd.DataFrame,
                 columns = None):
        self.results = data
        if columns:
            self.results = self.results[columns]
            
    def __getitem__(self, item):
        return Paper(self.results.loc[item])
    
    def __len__(self):
        return len(self.results)
        
    def _repr_html_(self):
        return self.results._repr_html_()
    
    def getDf(self):        
        return self.results 
    
def strip_characters(text):
    t = re.sub('\(|\)|:|,|;|\.|’|”|“|\?|%|>|<', '', text)
    t = re.sub('/', ' ', t)
    t = t.replace("'",'')
    return t

def clean(text):
    t = text.lower()
    t = strip_characters(t)
    t = str(t).translate(table)
    return t

def tokenize(text):
    words = nltk.word_tokenize(text)
    return list(set([word for word in words 
                     if len(word) > 1
                     and not word in english_stopwords
                     and not (word.isnumeric() and len(word) is not 4)
                     and (not word.isnumeric() or word.isalpha())] )
               )

def preprocess(text):
    t = clean(text)    
    tokens = tokenize(t)
    
    return tokens



# Step 4: Identify relevant articles based on the BM25 algorithm

![BM25_1.JPG](attachment:BM25_1.JPG)

In [None]:
# Functions defining the BM25 Algorithm

class WordTokenIndex:
    
    def __init__(self, 
                 corpus: pd.DataFrame, 
                 columns=SEARCH_DISPLAY_COLUMNS):
        self.corpus = corpus
        raw_search_str =self.corpus.title.fillna('') +' ' + self.corpus.body.fillna('')
        self.corpus['all_text'] = raw_search_str.apply(preprocess).to_frame()
        self.index = raw_search_str.apply(preprocess).to_frame()
        self.index.columns = ['terms']
        self.index.index = self.corpus.index
        self.columns = columns
       
    def search(self, search_string):
        search_terms = preprocess(search_string)
        result_index = self.index.terms.apply(lambda terms: any(i in terms for i in search_terms))
        results = self.corpus[result_index].copy().reset_index().rename(columns={'index':'paper'})
        return SearchResults(results, self.columns + ['paper'])
    
class RankBM25Index(WordTokenIndex):
    
    def __init__(self, corpus: pd.DataFrame, columns=SEARCH_DISPLAY_COLUMNS):
        super().__init__(corpus, columns)
        #self.bm25 = BM25Okapi(self.index.terms.tolist())
        self.bm25 = BM25Okapi(self.index.terms.tolist(),k1=3,b=0.001)
        
    def search(self, search_string, n=4):
        search_terms = preprocess(search_string)
        doc_scores = self.bm25.get_scores(search_terms)
        ind = np.argsort(doc_scores)[::-1][:n]
        results = self.corpus.iloc[ind][self.columns]
        results['BM25_Score'] = doc_scores[ind]
        results = results[results.BM25_Score > 0]
        return SearchResults(results.reset_index(), self.columns + ['BM25_Score'])
    
def show_task(taskTemp,taskId):
    #print(Task)
    keywords = taskTemp#tasks[tasks.Task == Task].Keywords.values[0]
    print(keywords)
    search_results = bm25_index.search(keywords, n=200)    
    return search_results

In [None]:
# Functions defining the BM25 Algorithm

class WordTokenIndex:
    
    def __init__(self, 
                 corpus: pd.DataFrame, 
                 columns=SEARCH_DISPLAY_COLUMNS):
        self.corpus = corpus
        raw_search_str =self.corpus.title.fillna('') +' ' + self.corpus.body.fillna('')
        self.corpus['all_text'] = raw_search_str.apply(preprocess).to_frame()
        self.index = raw_search_str.apply(preprocess).to_frame()
        self.index.columns = ['terms']
        self.index.index = self.corpus.index
        self.columns = columns
       
    def search(self, search_string):
        search_terms = preprocess(search_string)
        result_index = self.index.terms.apply(lambda terms: any(i in terms for i in search_terms))
        results = self.corpus[result_index].copy().reset_index().rename(columns={'index':'paper'})
        return SearchResults(results, self.columns + ['paper'])
    
class RankBM25Index(WordTokenIndex):
    
    def __init__(self, corpus: pd.DataFrame, columns=SEARCH_DISPLAY_COLUMNS):
        super().__init__(corpus, columns)
        #self.bm25 = BM25Okapi(self.index.terms.tolist())
        self.bm25 = BM25Okapi(self.index.terms.tolist(),k1=3,b=0.001)
        
    def search(self, search_string, n=4):
        search_terms = preprocess(search_string)
        doc_scores = self.bm25.get_scores(search_terms)
        ind = np.argsort(doc_scores)[::-1][:n]
        results = self.corpus.iloc[ind][self.columns]
        results['BM25_Score'] = doc_scores[ind]
        results = results[results.BM25_Score > 0]
        return SearchResults(results.reset_index(), self.columns + ['BM25_Score'])
    
def show_task(taskTemp,taskId):
    #print(Task)
    keywords = taskTemp#tasks[tasks.Task == Task].Keywords.values[0]
    print(keywords)
    search_results = bm25_index.search(keywords, n=200)    
    return search_results

In this step, the covid_task9_bm25index.pkl, which contains the BM25 index file, is run.
It uses the BM25 index created on April 15th so that you can use the solution without rebuilding the BM25 index from the latest covid-19 dataset.
If the latest dataset needs to be analyzed, set rebuild_index = True

In [None]:
%%time

rebuild_index = False

#BM25 algorithm getting trained on the text from the dataset

bm25index_file_create = '/kaggle/working/covid_task9_bm25index.pkl'
bm25index_file_load = '/kaggle/input/cord-19-bm25index/covid_task9_bm25index.pkl'
if rebuild_index:
        print("Running the BM25 index...")
        bm25_index = RankBM25Index(covid_df)
        print("Creating pickle file for the bm25 index...")
        with open(bm25index_file_create, 'wb') as file:
            pickle.dump(bm25_index, file)
        with open(bm25index_file_create, 'rb') as corpus_pt:
            bm25_index = pickle.load(corpus_pt)
        print("Completed load of the BM25 index from", bm25index_file_create, '...')
else:
    with open(bm25index_file_load, 'rb') as corpus_pt:
        bm25_index = pickle.load(corpus_pt)
    print("Completed load of the BM25 index from", bm25index_file_load, '...')


print("Shape of BM25: ", bm25_index.corpus.shape)

%%time
bm25_index = RankBM25Index(covid_df)
print("Shape of BM25: ", bm25_index.corpus.shape)

# Step 5: Identify articles based on dominant topics
Now that we have retrieved a smaller set of articles related to our sub task, we further analyze the text in these articles to find dominant topics.
We use LDA as a topic modeling technique to identify topics from each article and then measure the dominant effect of the topic using coherence measures. 
This enables the ranking of articles based on the highly dominant topics. The ones that score high are then extracted for the next stage of analysis.

## LDA technique for topic modelling

We use LDA to obtain a list of topics for each article. 
LDA is an unsupervised technique, meaning that we don’t know prior to running the model how many topics exist in our corpus.

![LDA.JPG](attachment:LDA.JPG)

## Coherence score
Topic coherence is a technique used to estimate the number of topics. 
Topic coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. 
These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.  

We use c_v measure to see the coherence score of our LDA model. 
C_v measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity.
C_V measure typically has values 0 < x < 1


## Rank articles with dominant topics
The last step is to find the optimal number of topics. We run the LDA model with different values of the number of topics (k) and pick the one that gives the highest coherence value. Choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics. If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.

![Coherence_score.JPG](attachment:Coherence_score.JPG)

In [None]:
# LDA & Coherence functions

# Find the optimal topic model
def findOptimalTopicModel(start_num_topics, noOfMaxTopics, step_num_topics,model_list, coherence_values): 
    temp_coherenace = 0
    best_model_idx = 0
    idx = 0
    number_of_topics = 0
    x = range(start_num_topics, noOfMaxTopics, step_num_topics)
    for m, cv in zip(x, coherence_values):
        #print("Num Topics =", m, " has Coherence Value of", round(cv, 4))    
        if(temp_coherenace<cv):
            temp_coherenace = cv
            best_model_idx = idx
            number_of_topics = m
        idx += 1
 
    # Select the model and print the topics
    optimal_model = model_list[best_model_idx]
    model_topics = optimal_model.show_topics(num_topics=number_of_topics,formatted=False)
    return (optimal_model, model_topics, number_of_topics, temp_coherenace)

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=2):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model=LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())        
    return model_list, coherence_values

def format_topics_sentences(df,ldamodel=None, corpus=None, texts=None):
    # Init output
    sent_topics_df = pd.DataFrame()
    
    # Get main topic in each documentp
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic                
                wp = ldamodel.show_topic(topic_num)
                
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic_num', 'Topic_Perc_Contrib', 'Topic_Keywords']

    # Add original text to the end of the output
    #contents = pd.Series(texts)
    sent_topics_df = pd.concat([df,sent_topics_df], axis=1)
    return(sent_topics_df)

# Step 6: Find top N sentences from relevant articles

We use the BERT Sentence_Transformer, which has been pre-trained on Natural Language Inference (NLI) data to extract the top N sentences that have the most relevence. 

We feed BERT with the sub task text and the articles filtered from the previous step, to find relevant responses.

The result is a ranked set of articles with the top scores and the top sentences found within the article relevant to the sub task.

![BERT3.JPG](attachment:BERT3.JPG)



## Steps in our approach
- Pass the ALL_TEXT data from each of the filtered articles 
- Get contextualized embedding from a pretrained BERT which was fine-tuned on Natural Language Inference (NLI) data 
- Apply contextualized embedding on the text of the sub task
- Apply cosine similarity on both the ALL_TEXT and the sub task text, to get the most similar sentences along with the articles of these sentences


In [None]:
%%time
# BERT Sentence Transformer model based on bert-base-nli-mean-tokens

model = SentenceTransformer('bert-base-nli-mean-tokens')
embedder = SentenceTransformer('bert-base-nli-mean-tokens')
top_N_sentence = 20


In [None]:
# BERT Functions for sentence embeddings

def sentenceEmbed(corpus_sentence, tcount):
    task = [task9_keywords[tcount]]
    corpus_sentence_embeddings = model.encode(corpus_sentence)
    query_embeddings = model.encode(task)

    for query, query_embedding in zip(task, query_embeddings):
        distances = scipy.spatial.distance.cdist([query_embedding], corpus_sentence_embeddings, "cosine")[0]

        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])

        #print("\n\n======================\n\n")
        #print("Query:", task)
        #print("\nTop 5 most similar sentences in corpus:")
        all_sentence =''
        top_3_sentence=''
        score = 0
        top3 = 1
        #print(results[0:top_N_sentence])
        #print("\n\n======================\n\n")
        for idx, distance in results[0:top_N_sentence]:
            #print("\n\n",corpus_sentence[idx].strip(), "(Score: %.4f)" % (1-distance),"***END***")
            if(top3<4):                
                top_3_sentence += corpus_sentence[idx].strip().capitalize() +"--- \n\n"
                top3 += 1
            
            all_sentence += corpus_sentence[idx].strip()
            score += (1-distance)   
        score = score/top_N_sentence
        scoreStr =''
        if(score<=0.3):
            scoreStr = 'Low'
        elif(score>0.31 and score<0.7):
            scoreStr = 'Medium'
        else:
            scoreStr = 'High'
    return top_3_sentence,all_sentence,scoreStr;


# Step 7: Optimize results based on Elbow and Knee cut-off methods

We added steps to measure the effectiveness of the results and further trim the results based on certain cut-off criteria. 

These methods are used to trim the resulting articles from BM25 and LDA to further optimize the results.

## BM25 optimization using elbow cut-off

The results from BM25 are further optimized by plotting the BM25 score against each article ranked in descending order. 

![BM25_elbow.JPG](attachment:BM25_elbow.JPG)

The resulting plot has an elbow shape. The elbow joint shows the cut-off where articles having a lower score than the cut-off can be ignored.

We use this method to further trim the results of BM25

## LDA optimization using knee cut-off
The topic modeling scores for each article are plotted in order to determine the optimum number of topics to consider. 

![Dominant_Topic.JPG](attachment:Dominant_Topic.JPG)

The results has a knee shape where the relevance of topics drops off after a certain number of topics within an article.

This drop-off point is used to determine the optimum number of topics to consider.

In [None]:
# Function to determine Cut-off point for identifying optimum value

def kneeLocator(x,y):
    kn = KneeLocator(x, y,curve='concave',direction='decreasing',online=True)
    return y[kn.knee-1],kn;

In [None]:
# Plot of the curve that shows optimum cut-off value

def plotKnee(kn,x,y,minval,maxval,x_label,y_label,axes,idx):
    #plt.xlabel(x_label)
    #plt.ylabel(y_label)
    #plt.plot(x, y, 'bx-')
    #plt.vlines(kn.knee, plt.ylim()[0], plt.ylim()[1], linestyles='dashed')
    
    axes[idx].plot(x, y,'bx-')    
    axes[idx].vlines(kn.knee, minval,maxval, linestyles='dashed')
    axes[idx].set(xlabel =x_label, ylabel = y_label, title = 'Knee Scoring')

    #plt.show()
    

In [None]:
# Visualization Functions used to show Topic coherence and word cloud formed from dominant topics

def plotCoherence(start_num_topics, noOfMaxTopics, step_num_topics,coherence_values,axes,idx):
    x = range(start_num_topics, noOfMaxTopics, step_num_topics)
    axes[idx].plot(x, coherence_values)
    #plt.plot(x, coherence_values)
    #plt.xlabel("Num Of Topics")
    #plt.ylabel("Coherence score")
    #plt.legend(("coherence_values"), loc='best')
    axes[idx].set(xlabel ="Num Of Topics", ylabel = "Coherence score", title = 'Coherence Scoring')
    #plt.show()
    #return plt;

def plotWordCloud(number_of_topics,model_topics):
    if(number_of_topics>10):
        number_of_topics = 10
    N_rows = int(number_of_topics/2)
    N_cols = int(2)
    i = 0
    #Word cloud
    cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]  # more colors: 'mcolors.XKCD_COLORS'

    cloud = WordCloud(background_color='white',
                  width=2500,
                  height=1800,
                  max_words=10,
                  colormap='tab10',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)

    fig, axes = plt.subplots(N_rows,N_cols, figsize=(10,10), sharex=True, sharey=True)    
    for i, ax in enumerate(axes.flatten()):
        fig.add_subplot(ax)
        topic_words = dict(model_topics[i][1])    
        cloud.generate_from_frequencies(topic_words, max_font_size=300)
        plt.gca().imshow(cloud)
        plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
        plt.gca().axis('off')

    plt.subplots_adjust(wspace=0, hspace=0)
    plt.axis('off')
    plt.margins(x=0, y=0)
    plt.tight_layout()
    plt.show()
    return plt;

We display the results in multiple ways.

## ScatterTextPlot

**What is a ScatterTextPlot?**

It is an interactive tool to visually represent high frequency words found in the most relevent articles.

**What does it do?**

We split the results into three categories 'High', 'Medium' and 'Low' based on their scores. We then compare the resulting text between the 'High' category of articles versus the rest.

The ScatterTextPlot shows the frequency of words between these two categories within four quadrants on the plot. The upper left corner are words that were found frequently in 'High' scoring articles, but not so much in others i.e. 'Medium' and 'Low'. They are likely to be rare and unique finds of high value text. The vice versa holds true for the lower right corner. These are words occuring frequently in lower scoring articles. They also are unique but probably not so rare as they occur in many lower ranked articles.

The bottom left corner are words infrequent in both sets of articles and may not be worth looking into. We say that, but you never know what you might find, so do check them out. The top right corner are words found frequently in both sets. As you may imagine, a lot of common English words may show up. If you skip past them, you may find words that are highly relevant and have a high frequency of occurrences in all articles.

**How do I use the ScatterTextPlot?**

You can search for any word or simply click a word on the plot and immediately see the exact sentences where those words were part of in the articles. For example, in the Search bar, you could search for the text 'Public' and it would auto complete with any further words it found in the articles, like ' Public health'. The search results will show all sentences where 'Public health' is mentioned.

Similarly, you could find patterns in the words displayed on the plot. The closeness of words is based on how frequently they are mentioned in articles. Simply clicking any of the words will show all sentences found in both category of articles.

This method can be used to get answers based on key word searches.

## t-SNE Plot
We group together articles that are talking about the same topics. These are topics we extracted based on relevance of the words found in the articles.

The t-SNE plot shows these clusters of articles in different colors. Hovering above any of the circles shows information about that article.

These clusters can be used to find a group of articles that may be related and have relevent information about a given sub task in Task 9.
How far (or close) each cluster is from the other also shows the relative similarity between the different topics.

In [None]:
# Interactive visualizations including Scatterplot and TSNE 

def scatterPlot(df,fileName,topNStr):
    categoryName = ''
   
    if((df[df['SentenceScore'] == 'High']).shape[0] >0):
        categoryName = 'High'
    elif((df[df['SentenceScore'] == 'Medium']).shape[0] >0):
        categoryName = 'Medium'
    else:
        categoryName = 'Low'
    
    if(len(df.SentenceScore.unique()) >1):
        corpusOfNsentence = st.CorpusFromPandas(df,category_col="SentenceScore",text_col=topNStr,nlp=nlp_x).build()
        html = st.produce_scattertext_explorer(corpusOfNsentence,
                                        category=categoryName,
                                        category_name="Research Papers with "+categoryName+" Score",
                                        not_category_name='Others',
                                        width_in_pixels=1000,
                                        minimum_term_frequency=2,
                                        transform=st.Scalers.percentile)

        open(fileName, 'wb').write(html.encode('utf-8'))
        display(IFrame(src=fileName, width = 1800, height=700))
    else:
        print('No more than one category to produce scatter plot')

def tsneplot(df,topNStr):
    print(topNStr)

    tsne = TSNE(verbose=1, perplexity=5)
    np.random.seed(2017)
    texts = df['all_text'].values
    dictionary2 = corpora.Dictionary(texts)
    corpus2 = [dictionary2.doc2bow(text) for text in texts]

    ldamodel2 = models.ldamodel.LdaModel(corpus2, id2word=dictionary2, 
                                    num_topics=15, passes=20, minimum_probability=0)

    hm = np.array([[y for (x,y) in ldamodel2[corpus2[i]]] for i in range(len(corpus2))])
    tsne = TSNE(n_components=2)
    embedding = tsne.fit_transform(hm)
    embedding = pd.DataFrame(embedding, columns=['x','y'])
    embedding['hue'] = hm.argmax(axis=1)
    
    output_notebook()

    source = ColumnDataSource(
            data=dict(
            x = embedding.x,
            y = embedding.y,
            colors = [all_palettes['Category20'][15][i] for i in embedding.hue],
            title = df.title,
            year = df.publish_time,
            Top_Sentences = df[topNStr],
            SubTask = df.SubTask,
            url=df.url,
            alpha = [0.9] * embedding.shape[0],
            size = [14] * embedding.shape[0]
        )
    )
    hover_tsne = HoverTool(names=["final_results"], tooltips="""
        <div style="margin: 10">
            <div style="margin: 0 auto; width:300px;">
                <span style="font-size: 12px; font-weight: bold;">Title:</span>
                <span style="font-size: 12px">@title</span>
                <span style="font-size: 12px; font-weight: bold;">Year:</span>
                <span style="font-size: 12px">@year</span>
                <span style="font-size: 12px; font-weight: bold;">SubTask:</span>
                <span style="font-size: 12px">@SubTask</span>
                <span style="font-size: 12px; font-weight: bold;">URL:</span>
                <span style="font-size: 12px">@url</span>
            </div>
        </div>
        """)
    tools_tsne = [hover_tsne, 'pan', 'wheel_zoom', 'reset']
    plot_tsne = figure(plot_width=700, plot_height=700, tools=tools_tsne, title='Papers')
    plot_tsne.circle('x', 'y', size='size', fill_color='colors', 
                 alpha='alpha', line_alpha=0, line_width=0.01, source=source, name="final_results")


    callback = CustomJS(args=dict(source=source), code=
    """
    var data = source.data;
    var f = cb_obj.value
    x = data['x']
    y = data['y']
    colors = data['colors']
    alpha = data['alpha']
    title = data['title']
    year = data['year']
    size = data['size']
    for (i = 0; i < x.length; i++) {
        if (year[i] <= f) {
            alpha[i] = 0.9
            size[i] = 7
        } else {
            alpha[i] = 0.05
            size[i] = 4
        }
    }
    source.change.emit();
    """)

    layout = column(plot_tsne)
    show(layout)

In [None]:
# HTML version of scatterplot

def generate_html_table(df):

    css_style = """table.paleBlueRows {
      font-family: "Trebuchet MS", Helvetica, sans-serif;
      border: 1px solid #FFFFFF;
      width: 100%;
      height: 150px;
      text-align: center;
      border-collapse: collapse;
    }
    table.paleBlueRows td, table.paleBlueRows th {
      text-align: center;
      border: 1px solid #FFFFFF;
      padding: 3px 2px;
      
    }
    table.paleBlueRows tbody td {
      text-align: center;
      font-size: 11px;
      
    }
    table.paleBlueRows tr:nth-child(even) {
      background: #D0E4F5;
    }
    table.paleBlueRows thead {
      background: #0B6FA4;
      border-bottom: 5px solid #FFFFFF;
    }
    table.paleBlueRows thead th {
      font-size: 17px;
      font-weight: bold;
      color: #FFFFFF;
      border-left: 2px solid #FFFFFF;
    }
    table.paleBlueRows thead th:first-child {
      border-left: none;
    }

    table.paleBlueRows tfoot {
      font-size: 14px;
      font-weight: bold;
      color: #333333;
      background: #D0E4F5;
      border-top: 3px solid #444444;
    }
    table.paleBlueRows tfoot td {
      font-size: 14px;
    }
    div.scrollable {width:100%; max-height:150px; overflow:auto; text-align: center;}
    """
    urlColIdx = df.columns.get_loc('url') 
    titleColIdx = df.columns.get_loc('title')
    pubColIdx = df.columns.get_loc('publish_time')
    
    doc, tag, text, line = Doc().ttl()

    with tag("head"):
        with tag("style"):
            text(css_style)


    with tag('table', klass='paleBlueRows'):
        with tag("tr"):
            for col in list(df.columns):
                if(col not in ('url')):
                    with tag("th"):
                         with tag("div", klass = "scrollable"):
                            text(col)
                        
        for idx, row in df.iterrows():
            with tag('tr'):
                for i in range(len(row)):
                    if(i==titleColIdx):                       
                        with tag('td'):
                            with tag("div", klass = "scrollable"):                            
                                if "http" in row[urlColIdx]:
                                    with tag("a", href = str(row[urlColIdx])):
                                        text(str(row[i]))
                                else:
                                    text(str(row[i]))
                    elif(i==pubColIdx):
                        with tag('td'):
                            with tag("div", klass = "scrollable"):                           
                                if(row[i]=="1900"):
                                    text("Not Available")
                                else:
                                    text(str(row[i]))
                    elif(i==urlColIdx):
                        None
                    else:
                        with tag('td'):
                            with tag("div", klass = "scrollable"):                            
                                text(str(row[i]))

    display(HTML(doc.getvalue()))

# Step 8: Call all functions 

This is where all the ingredients are brought together for execution. 

1. We list text from the sub tasks related to Task 9 of the COVID-19 challenge. Since Task 9 has multiple sub tasks, each is comma delimited so we can identify best responses for each one separately.

2. We define a function that can be called by providing the text of a sub task. It returns a list of articles and sentences that are found to have the most relevant answers.

In [None]:
# This section is where we list text from the questions related to Task 9 of the COVID-19 challenge.
# Since Task 9 has multiple sub-questions, each is comma delimited so we can identify best responses for each individually.

task9_keywords = ["testing covid-19 sars-cov-2 coronavirus studies lab testing pandemic research data collection data standards nomenclature data gathering 2019-nCov SARS MERS",
"coronavirus insurance companies hospitals emergency room schools nursing homes workplaces covid-19 sars-cov-2 state officials local officials mitigation strategies telehealth colleges universities 2019-nCov MERS",
"coronavirus COVID19 COVID-19 SARS-CoV-2 at-risk  Understanding mitigating barriers information-sharing information sharing   information source insight discernment recognition shackles constraints hindrances impediments obstacles",
"coronavirus COVID19 COVID-19 SARS-CoV-2 recruit support coordinate local non-Federal expertise capacity relevant public health emergency response public private commercial non-profit academic",
"2020 2019 SARS-CoV-2 surveillance trace contact transmission public state interview evaluation monitor address interview symptoms",
"2020 2019 SARS-CoV-2 capacity interventions actionable prevent prepare funding investments financing public government future potential",
"aging population COVID19 Novel 2019 COVID-19 SARS-CoV-2 at-risk communication medical professionals critical workers relaying information social media",
"COVID19 Novel 2019 COVID-19 SARS-CoV-2 at-risk conveying information mitigation measures child care advice parents families children communications transparent protocol measures mitigation",
"risk disease population communications Coronavirus SARS-CoV-2 2019-nCov COVID-19 COVID19 COVID messaging notification contagion",
"misunderstanding containment mitigation misinterpretation regulation COVID-19 SARS-CoV-2 pathogens epidemiology coronavirus disease COVID19 2019",
"Action plan mitigate gaps problems inequity public health capability capacity funding citizens needs access surveillance treatment COVID19  2109 COVID-19 SARS-CoV-2",
"2020 2019 COVID-19 traditional approaches community-based interventions digital Inclusion develop local response ensuring communications marginalized disadvantaged populations research priorities",
"2020 2019 COVID-19 prison correctional federal state local inmate jail sheriff officer facility non-violent offenders guards deputy penal authorities locked incarcerated security custody",
"2020 2019 COVID-19 benefit patient client rejection insurance coverage care consultation eligibility plan risk factors policy therapy treatment payment in-network out-of-network deductible"]

In [None]:
task9 = ["1. Methods for coordinating data-gathering with standardized nomenclature.",
"2. Sharing response information among planners, providers, and others.",
"3. Understanding and mitigating barriers to information-sharing.",
"4. How to recruit, support, and coordinate local (non-Federal) expertise and capacity relevant to public health emergency response (public, private, commercial and non-profit, including academic).",
"5. Integration of federal/state/local public health surveillance systems.",
"6. Value of investments in baseline public health response infrastructure preparedness",
"7. Modes of communicating with target high-risk populations (elderly, health care workers).",
"8. Risk communication and guidelines that are easy to understand and follow (include targeting at risk populations’ families too).",
"9. Communication that indicates potential risk of disease to all population groups.",
"10. Misunderstanding around containment and mitigation.",
"11. Action plan to mitigate gaps and problems of inequity in the Nation’s public health capability, capacity, and funding to ensure all citizens in need are supported and can access information, surveillance, and treatment.",
"12. Measures to reach marginalized and disadvantaged populations. Data systems and research priorities and agendas incorporate attention to the needs and circumstances of disadvantaged populations and underrepresented minorities.",
"13. Mitigating threats to incarcerated people from COVID-19, assuring access to information, prevention, diagnosis, and treatment.",
"14. Understanding coverage policies (barriers and opportunities) related to testing, treatment, and care"]

In [None]:
dict_task=dict(zip(task9,task9_keywords))

In [None]:
final_dict = {}
#header_dict = {'': '','All': 'All'}
header_dict = {'All': 'All'}
final_dict = dict(header_dict, **dict_task)

In [None]:
# Function that finds most relevent articles and sentences for a given input text. 

def bm25res(tcount,visualization):
    print('\033[1m' + '*********************************Start*****************************************')
    print('\033[1m' + 'Subtask: ' + task9[tcount] + '\n') 
    
    bm25_results = bm25_index.search(task9_keywords[tcount],n=covid_df.shape[0])
    bm25_df = bm25_results.getDf()
    
    bm25_score = bm25_df['BM25_Score'].sort_values(ascending=False).tolist()
    #print('Max BM25 Score = ',bm25_df['BM25_Score'].max())
    print('\033[1m' + 'BM25 Score Selection')
    bm25_x_idx = range(1, len(bm25_score)+1)    
    bm25_kn = KneeLocator(bm25_x_idx, bm25_score,curve='concave',direction='decreasing',online=True)
    #optimal_bm25_score ,kn_bm25= kneeLocator(bm25_x_idx,bm25_score)
    optimal_bm25_score = bm25_score[bm25_kn.knee-1]
    
    covid_df_bm25_filter =bm25_df[bm25_df['BM25_Score'] >=optimal_bm25_score]
    
    if (covid_df_bm25_filter.shape[0] <10):
        covid_df_bm25_filter= bm25_df.iloc[:25,:]

    print('\033[1m' + 'Number of Papers Selected after BM25 Scoring = ', covid_df_bm25_filter.shape[0])        
    docs = covid_df_bm25_filter.all_text
    dictionary = Dictionary(docs)
    if(len(docs)<10):
        dictionary.filter_extremes(no_below=1)
    else:
        dictionary.filter_extremes()
    #Create dictionary and corpus required for Topic Modeling
    corpus = [dictionary.doc2bow(doc) for doc in docs]
    noOfDocs = len(corpus)
    start_num_topics = 0
    step_num_topics = 2
    if(noOfDocs>=200):
        noOfMaxTopics = int(noOfDocs*0.1)
        if(noOfMaxTopics>100):
            noOfMaxTopics = 100
        start_num_topics = 5
        step_num_topics = 5
    elif(noOfDocs>=50 and noOfDocs<200):
        noOfMaxTopics = 20
        start_num_topics = 2
        step_num_topics = 2
    else:
        noOfMaxTopics = 10
        start_num_topics = 2
        step_num_topics = 2
    #print('Number of unique tokens: %d' % len(dictionary))
    #print('Number of documents: %d' % noOfDocs)
    #print('Number of max topics: %d' % noOfMaxTopics)
    print('\033[1m' + 'Finding Optimal of topics is in progress = ',noOfMaxTopics)
    model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=docs, start=start_num_topics, limit=noOfMaxTopics, step=step_num_topics)
    optimal_model, model_topics, number_of_topics, temp_coherenace= findOptimalTopicModel(start_num_topics, noOfMaxTopics, step_num_topics, model_list, coherence_values)
    print('\033[1m' + 'Optimal Number of Topics = ',number_of_topics)
    print('\033[1m' + 'Coherence Score = ', temp_coherenace)
    
    df_topic_sents_keywords = format_topics_sentences(covid_df_bm25_filter,ldamodel=optimal_model, corpus=corpus, texts=docs)
    df_dominant_topic = df_topic_sents_keywords.reset_index()
    

    topicPercContrib = df_dominant_topic.Topic_Perc_Contrib.sort_values(ascending=False).tolist()    
    print('\033[1m' + 'Dominant Score Selection')
    topic_contrib_x_idx = range(1, len(topicPercContrib)+1)    
    topic_kn = KneeLocator(topic_contrib_x_idx, topicPercContrib,curve='concave',direction='decreasing',online=True)
    
    #optimal_topic_score,kn_topic = kneeLocator(topic_contrib_x_idx ,topicPercContrib)
    
    
    #optimal_bm25_score = bm25_score[bm25_kn.knee-1]
    optimal_topic_score = topicPercContrib[topic_kn.knee-1]
    
    dominant_topic_filtered_df = df_dominant_topic[df_dominant_topic['Topic_Perc_Contrib']>=optimal_topic_score]
    
    print('\033[1m' + 'Number of Papers Selected after Dominant Topic Scoring = ', dominant_topic_filtered_df.shape[0])    
    
    topNStr = 'Top_'+str(top_N_sentence)+'_Sentence'
        
    title_body_str = dominant_topic_filtered_df.title.fillna('') +' ' + dominant_topic_filtered_df.body.fillna('')
    dominant_topic_filtered_df['title_body_clean'] = title_body_str.apply(clean_text).to_frame()

    sentence_embed_df = dominant_topic_filtered_df.title_body_clean.apply(sentenceEmbed,args=[tcount])
    dominant_topic_filtered_df[['Top_3_Sentence',topNStr,'SentenceScore']] = pd.DataFrame(sentence_embed_df.to_list(), columns=['Top_3_Sentence',topNStr,'SentenceScore'], index=sentence_embed_df.index)    
    dominant_topic_filtered_df = dominant_topic_filtered_df.sort_values(by=['SentenceScore'], ascending=False)      
    
    results = dominant_topic_filtered_df[['paper_id','title',topNStr,'Top_3_Sentence','SentenceScore','publish_time','url','all_text']]    
    
    if(visualization):
        finalfig, finalaxis = plt.subplots(1,3,figsize=(20,5))
        plotKnee(bm25_kn,bm25_x_idx,bm25_score,min(bm25_score),max(bm25_score),'BM25 Doc#','BM25 Score',finalaxis,0)
        plotCoherence(start_num_topics, noOfMaxTopics, step_num_topics,coherence_values,finalaxis,1)
        plotKnee(topic_kn,topic_contrib_x_idx,topicPercContrib,min(topicPercContrib),max(topicPercContrib),'Topic Doc ID','Topic % Contribution',finalaxis,2)
        plt.show()
        print('\033[1m' + '\nTop N Topics Word Cloud' + '\033[0m')
        plotWordCloud(number_of_topics,model_topics)

    print('\033[1m' + 'Final Number of Papers Selected = ', results.shape[0])
    print('\033[1m' + 'Subtask: ' + task9[tcount] + '\n')
    print('\033[1m' + '*********************************Completed*****************************************')
    return results;

# Step 9: Execute

This is where the magic happens. For those who want to use our code, this is the section where you choose the sub task for which you want responses, and let the code run. You could either choose the "All" option, which will identify answers for all 15 sub tasks that are part of Task 9, or specify a single sub task.

The results are provided in four different ways:
- Relevant sentences from relevant articles that best answer each question
- A scattertextplot which is an interactive way to view and search for key words and sentences from the resulting articles
- A t-SNE plot that shows how articles are clustered together based on the topics found relevent for a sub task
- A Word wall that highlights the most frequent words showing in the resulting articles.


In [None]:
def run_task(val):
    tcount = list(dict_task.values()).index(val)
    #taskscount=len(task9)
    # taskscount=1
    # tcount = 0
    visualization=False
    taskResults = pd.DataFrame(columns=[])
    #for tcount in range(taskscount):
    fileName = task9[tcount]
    results = bm25res(tcount, visualization)
        #results.to_csv(fileName + '.csv', index=False)   
        #scatterPlot(results,fileName+'.html','Top_'+str(top_N_sentence)+'_Sentence')
    temp = results
    temp['SubTask'] = fileName
    temp['SubTask'] = temp.SubTask.fillna(fileName)
    taskResults = taskResults.append(temp)
    taskResults =taskResults.reset_index(drop=True)
    final_visualization(taskResults)
    
def run_task_all():
    taskscount=len(task9)
    # taskscount=1
    # tcount = 0
    visualization=False
    taskResults = pd.DataFrame(columns=[])
    for tcount in range(taskscount):
        #fileName = 'task9_subtask_' + str(tcount)
        fileName = task9[tcount]
        results = bm25res(tcount, visualization)
            #results.to_csv(fileName + '.csv', index=False)    
            #scatterPlot(results,fileName+'.html','Top_'+str(top_N_sentence)+'_Sentence')
        temp = results
        temp['SubTask'] = fileName
        temp['SubTask'] = temp.SubTask.fillna(fileName)
        taskResults = taskResults.append(temp)
    taskResults =taskResults.reset_index(drop=True)
    final_visualization(taskResults)

def final_visualization(taskResults):
    topNColumnName = 'Top_'+str(top_N_sentence)+'_Sentence'
    taskResults.fillna('',inplace=True)
    #taskResults.to_csv('all_task_final_output.csv', index=False)
    scatterPlot(taskResults,'Task9_Scatterplot.html',topNColumnName)
    tsneplot(taskResults,topNColumnName)
    taskResults['publish_time'] = taskResults['publish_time'].astype(str)
    generate_html_table(taskResults[['SubTask','title','Top_3_Sentence','publish_time','url']])

The text is specific to each sub task in Task 9.

If you are not much of a visual person, then this section may help. 

We put together the top three senteces found in the top articles related to each sub task. 

These sentences are derived from the various algorithms and calculations performed in the code and is our best match effort to answer the sub tasks.

#Enable the below code to run using drop down menu

@interact
def drop_down(x=final_dict):
    dropdown=list(final_dict.keys())[list(final_dict.values()).index(x)]
    '''if x == '':
        print ('')
    elif x == "All":'''
    if x == "All":
        run_task_all()
    else: 
        run_task(x)
    return dropdown

In [None]:
#Runs all the sub tasks for the task 9
run_task_all()

**Ericsson, the world’s leading telecommunications company, cares about doing good. This task was completed as part of our Ericsson for Good program, which allows our 90,000+ employees to contribute to their communities.**

© This Notebook has been released under the OSI-approved GNU LGPLv2.1 license; Google BERT is under Apache License 2.0 (https://github.com/google-research/bert/blob/master/LICENSE); Facebook’s fairseq is under MIT license: https://github.com/pytorch/fairseq/blob/master/LICENSE