# NPI-Context: Using intervention context to inform literature search with case study

1. We make it easier to conduct literature search for new interventions in specific environments by incorporating the context for each intervention in our search.

2. We demonstrate this method on a newly constructed dataset of Canadian NPIs. To show the benefit we compare using general Oxford intervention categories as search terms with our method and demonstrate that new relevant research is surfaced.

## Introduction

The competition named "the COVID-19 Open Research Dataset (CORD-19)" has been launched to support experts in the healthcare domain quickly and accurately receive answers to their scientific questions related to coronaviruses. We can take advantage of NLP and ML tools to develop improved ways of finding relevant research to guide policy actions taken by governments and organizations around the world. CORD-19 encompasses 40,000 articles about coronaviruses. For the competition, 10 tasks have been proposed. Each task covers some fundamental questions related to COVID 19. In this submission, we focus on answering the questions in the task related to non-pharmaceutical interventions. In particular we aim to answer:

- What do we know about the effectiveness of non-pharmaceutical interventions?
- What is known about equity and barriers to compliance for non-pharmaceutical interventions?

## Method

### Searching the CORD-19 Dataset

We make use of the [covidex.io](https://covidex.io) project using the Anserini information retrieval toolkit via pyserini. All the documents in CORD-19 are indexed in Lucene. We build off the demonstration notebook found [here](https://colab.research.google.com/drive/1mrapJp6-RIB-3u6FaJVa4WEwFdEBOcTe) to setup the lucene index and search functionality.

Thanks to [Jimmy Lin](https://cs.uwaterloo.ca/~jimmylin/) from the University of Waterloo and [Kyunghyun Cho](http://www.kyunghyuncho.me/) from NYU and their team for building this.

### Building a Dataset of Intervention Events

Policy makers and researchers around the world use literature review to help each team, organization, and country understand the effectiveness of non-pharmaceutical interventions and barriers to compliance *for their specific circumstances*. Observing the leading countries in COVID-19 response like South Korea and China we see drastically different methods used to intervene. **Making use of country-specific context is an important part of improving search quality.**

In order to show the effectivenss of this it is important to have an up-to-date and thorough picture of each countries current interventions and how they are being implemented. One has been created for Canada to use as a case study.

The [howsmyflattening.ca](https://howsmyflattening.ca) team has compiled a dataset of non-pharmaceutical interventions in Canada with 60 intervention labels, 1838 events, and more than 900 unique information sources. Some of the authors of this notebook are contributors to the Canadian non-pharmaceutical interventions dataset. The dataset can be retrieved on Kaggle [here](https://www.kaggle.com/howsmyflattening/covid19-challenges#npi_canada.csv).

### Intervention Context using Topic Modeling

We use Latent Dirichlet Allocation [(Blei, et. al., 2003)](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) to find topics in the full text announcements recorded for all recorded interventions in the input dataset. We then use keywords from these topics to guide search of relevant documents, comparing and augmenting the search results of the labeled interventions themselves. **Crucially, we are not just modeling topics in existing research, but also in actual interventions to understand the relationships between them.**


## Putting it all together - A Case Study in Canada

Below we use Canadian intervention data as a case study for our approach. We apply topic modeling to the intervention text and compare the search results with our baseline approach and show that the context-keyword generation leads to new, relevant results.

# Setup

In [None]:
import os
# Thanks to https://www.kaggle.com/dirktheeng/anserini-bert-squad-for-semantic-corpus-search/
# for the code on how to setup Java 11
!curl -O https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz
!mv openjdk-11.0.2_linux-x64_bin.tar.gz /usr/lib/jvm/; cd /usr/lib/jvm/; tar -zxvf openjdk-11.0.2_linux-x64_bin.tar.gz
!update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk-11.0.2/bin/java 1
!update-alternatives --set java /usr/lib/jvm/jdk-11.0.2/bin/java
os.environ["JAVA_HOME"] = "/usr/lib/jvm/jdk-11.0.2"

In [None]:
!pip install pyserini==0.8.1.0
!pip install transformers
!pip install geopandas
!pip install pyLDAvis

In [None]:
!jupyter nbextension enable --py --sys-prefix widgetsnbextension


In [None]:
%%capture


from IPython.core.display import display, HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import re
import json
import os

from pyserini.search import pysearch


from IPython.html.widgets import interactive
from ipywidgets import interact, interact_manual
from ipywidgets import Layout, Button, Box, FloatText, Textarea, Dropdown, Label, IntSlider

from matplotlib import cm
import seaborn as sns
import matplotlib.patches as mpatches
import ipywidgets as widgets
from mpl_toolkits.axes_grid1 import AxesGrid
import geopandas as gpd

import nltk
nltk.download('stopwords')

Let's grab the pre-built index:

In [None]:
%%capture
!wget https://www.dropbox.com/s/d6v9fensyi7q3gb/lucene-index-covid-2020-04-03.tar.gz
!tar xvfz lucene-index-covid-2020-04-03.tar.gz

Sanity check of index size (should be 1.5G):

In [None]:
!du -h lucene-index-covid-2020-04-03

In [None]:
searcher = pysearch.SimpleSearcher('lucene-index-covid-2020-04-03/')

def search(search_strings, topk=5):
  columns = ['search', 'rank', 'title', 'score']
  results_df = pd.DataFrame()
  for search in search_strings:
    hits = searcher.search(search)
    #label_hits.append(hits)
    print("")
    print("Search term: ", search)
    print("  hits:", len(hits))
    scores = [h.score for h in hits]
    print("  mean score:", np.mean(scores))
    print("")
    for i in range(0, min(topk, len(hits))):
      print(f'{i+1:2} {hits[i].docid} {hits[i].score:.5f} {hits[i].lucene_document.get("title")} {hits[i].lucene_document.get("doi")}')
      row_df = pd.DataFrame([[search, i+1, hits[i].lucene_document.get("title"), hits[i].score]], columns=columns)
      results_df = results_df.append(row_df)

  return results_df

# Oxford Government Response Tracking

In order to aid efforts to fight the pandemic, we aim to use a standard metric for analyzing the governments responses. The Oxford COVID-19 Government Response Tracker (OxCGRT) has collected the Coronavirus Government Response Tracker Dataset. The dataset is collected and updated in real time by a team of dozens of students and staff at Oxford University. This Dataset is available at this [link](https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker).
 They have provided 13 indicators of such responses. 9 of these metrics (S1-S7, S12, and S13) are non-financial policies such as event cancelation and the others (S8-S11) are financial indicators such as monetary measures. The Canadian NPI dataset linked all eligible interventions type to one of the Oxford categories. 
 A list of these indicators

Each indicator contains a range of values. For more information on the list of the indicators and the encodings visit the [Encodings](https://www.bsg.ox.ac.uk/sites/default/files/2020-04/BSG-WP-2020-031-v4.0_0.pdf).
Averaging the stringency numbers gives a composite index and allows us to understand how quickly and strinctly different areas of Canada reacted to COVID-19 over time. The Stringency Index is calculated using only the policy indicators S1 – S7. This metric is calculated by averaging the normalized values from each indicator. Further details on calculation of this index is provided at: [Calculation details](bsg.ox.ac.uk/sites/default/files/Calculation%20and%20presentation%20of%20the%20Stringency%20Index.pdf)

Note that this index simply records the number and strictness of government policies and should not be interpreted as ‘scoring’ the appropriateness or effectiveness of a country’s response.

The following map summerizes the intervention policies taken in each county. Selecting an intervention type and a specific day from Jan 1st 2020 to April 2020 would show the level of government response on the map. moving the pointer on each country would show the number of positive cases and intervention level.

In [None]:
%%html
<div class='tableauPlaceholder' id='viz1587056618048' style='position: relative'><noscript><a href='https:&#47;&#47;www.bsg.ox.ac.uk&#47;research&#47;publications&#47;variation-government-responses-covid-19'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ox&#47;Oxford-COVID-19&#47;Geo&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Oxford-COVID-19&#47;Geo' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ox&#47;Oxford-COVID-19&#47;Geo&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1587056618048');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

# Canadian non-pharmaceutical interventions dataset

We seek to improve on the Oxford dataset by finding individual intervention announcements at the city, province, and national level. This provides the content we can learn from to add context to our search.

To this end we aided in the construction of the Canadian non-pharmaceutical interventions dataset covers compiled information from **January 1st to March 31st, 2020** across 13 provinces and territories as well as the 20 largest census metropolitan areas in Canada.
The 60 individual types of intervention contained in this dataset includes (but not restricted to) government announcements, initiatives, and orders, such as social distancing measures or social and fiscal measures. They are tagged with the appropriate oxford interention indicators when appropriate. The rest of the notebook will use this as the reference dataset for creating visualization and topic modelling.

In [None]:
!wget https://raw.githubusercontent.com/jajsmith/COVID19NonPharmaceuticalInterventions/master/npi_full.csv

In [None]:
full_df = pd.read_csv('npi_full.csv')
full_df['start_date'] = pd.to_datetime(full_df['start_date'])
full_df['end_date'] = pd.to_datetime(full_df['end_date'])
full_df['oxford_fiscal_measure_cad'] = full_df['oxford_fiscal_measure_cad'].replace('[\$,]', '', regex=True).astype(float)

## Understanding Canadian Interventions

TODO: write a bit here about the Oxford intervention types and how they are tracked. https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker

### Interactive Visualization Setup

In [None]:
!wget https://flatteningthecurve.herokuapp.com/data/covid

In [None]:
full_cases = pd.read_csv('covid')

In [None]:
intervention_categories = ['S1 School Closing',
                           'S2 Workplace closing',
                           'S3 Cancel public events',
                           'S4 Close public transport',
                           'S5 Public info campaigns',
                           'S6 Restrictions on internal movements',
                           'S7 International travel controls',
                           'S8 Fiscal measures',
                           'S9 Monetary measures (interest rate)',
                           'S10 Emergency investment in health care',
                           'S11 Investment in vaccines',
                           'S12 Testing policy',
                           'S13 Contact tracing']

In [None]:
def parse_rate(string):
    if type(string) == float:
        return string
    cad = string[:-1]
    return float(cad)

def impute_intervention(prov):
    for interv_cat in intervention_categories:
        prov[interv_cat] = 0
    closure_geo = ['S1', 'S2', 'S3', 'S4', 'S6']
    public_geo = ['S5']
    travel = ['S7']
    rate = ['S9']
    fiscal = ['S8', 'S10', 'S11']
    test = ['S12']
    trace = ['S13']
    for idx, row in prov.iterrows():
        interv = row['oxford_government_response_category']
        if interv in intervention_categories:
            interv_prefix = str(interv).split(' ')[0]
            subset = prov.iloc[:idx+1]
            subset = subset[subset['oxford_government_response_category'] == interv]
            if interv_prefix in closure_geo:
                prov.at[idx, interv] = (np.nanmax(subset['oxford_closure_code']) + np.nanmax(subset['oxford_geographic_target_code'])) * 100 / 3
            elif interv_prefix in public_geo:
                prov.at[idx, interv] = (np.nanmax(subset['oxford_geographic_target_code']) + np.nanmax(subset['oxford_public_info_code'])) * 100 / 2
            elif interv_prefix in travel:
                prov.at[idx, interv] = (subset['oxford_travel_code'].max()) * 100 / 3
            elif interv_prefix in rate:
                prov.at[idx, interv] = subset['oxford_monetary_measure'].apply(parse_rate).sum()
            elif interv_prefix in fiscal:
                prov.at[idx, interv] = pd.to_numeric(subset['oxford_fiscal_measure_cad']).sum()
            elif interv in test:
                prov.at[idx, interv] = subset['oxford_testing_code'].max() * 100 / 2
            elif interv in trace:
                prov.at[idx, interv] = subset['oxford_tracing_code'].max() * 100 / 2
            if idx > 0:
                for i in intervention_categories:
                    if i != interv:
                        prov.at[idx, i] = prov.at[idx-1, i]
        else:
            if idx > 0:
                for i in intervention_categories:
                    prov.at[idx, i] = prov.at[idx-1, i]
    return prov

In [None]:
def construct_positive(prov):
    prov['Cumulative Cases'] = 0
    for idx, row in prov.iterrows():
        if idx == 0:
            prov.at[0, 'Cumulative Cases'] = prov.iloc[0]['Daily Cases']
        else:
            prov.at[idx, 'Cumulative Cases'] = prov.iloc[idx-1]['Cumulative Cases'] + prov.iloc[idx]['Daily Cases']
    
    return prov

In [None]:
def generate_cases_province(full_npi, pn, pn_short):
    prov = full_npi[(full_npi['region'] == pn)]
    
    # mb_list = ['Region', pn, pn+'.1', pn+'.2', pn+'.3', pn+'.4', pn+'.5']
    # mb = mobility[mb_list]
    # mb = mb.rename(columns={mb_list[0]: 'start_date',
    #                         mb_list[1]: mobility_list[0],
    #                         mb_list[2]: mobility_list[1],
    #                         mb_list[3]: mobility_list[2],
    #                         mb_list[4]: mobility_list[3],
    #                         mb_list[5]: mobility_list[4],
    #                         mb_list[6]: mobility_list[5]
    #                         })
    # mb = mb.iloc[1:]
    # mb['start_date'] = pd.to_datetime(mb['start_date'], format='%d-%m-%Y')
    # for mb_type in mobility_list:
    #     mb[mb_type] = pd.to_numeric(mb[mb_type])
    #     mb[mb_type] = mb[mb_type].apply(lambda x: scale_mob(x, mb[mb_type].min(), mb[mb_type].max()))
       
    prov = prov[['start_date', 'region', 'end_date', 'oxford_government_response_category', 'oxford_closure_code',
       'oxford_public_info_code', 'oxford_travel_code',
       'oxford_geographic_target_code', 'oxford_fiscal_measure_cad',
       'oxford_monetary_measure', 'oxford_testing_code', 'oxford_tracing_code']]
    
    prov['start_date'] =  pd.to_datetime(prov['start_date'], infer_datetime_format='%m/%d/%Y')
    
    cs = full_cases[full_cases['province'] == pn_short]
    cs['start_date'] = pd.to_datetime(cs['date'], format='%Y-%m-%d')
    cs = cs.groupby('start_date')['id'].agg('count').reset_index().rename(columns={'id': 'Daily Cases'})
    cs = construct_positive(cs)
    
    # prov = pd.merge(prov, mb[['start_date'] + mobility_list], on='start_date', how='outer')
    prov = pd.merge(prov, cs, on='start_date', how='outer')
    prov = prov.sort_values(by='start_date',ascending=True).reset_index(drop=True)
    
    prov = impute_intervention(prov)
    prov = prov.drop_duplicates(subset ="start_date", 
                     keep = 'last').reset_index(drop=True) 
    
    
    return prov

In [None]:
full_viz = full_df.copy()
full_viz['oxford_geographic_target_code'].fillna(0, inplace=True)
full_viz['oxford_closure_code'].fillna(0, inplace=True)
full_viz['oxford_public_info_code'].fillna(0, inplace=True)

In [None]:
on = generate_cases_province(full_viz, 'Ontario', 'Ontario')
qb = generate_cases_province(full_viz, 'Quebec', 'Quebec')
bc = generate_cases_province(full_viz, 'British Columbia', 'BC')
sk = generate_cases_province(full_viz, 'Saskatchewan', 'Saskatchewan')
nb = generate_cases_province(full_viz, 'New Brunswick', 'New Brunswick')
ns = generate_cases_province(full_viz, 'Nova Scotia', 'Nova Scotia')
mb = generate_cases_province(full_viz, 'Manitoba', 'Manitoba')
ab = generate_cases_province(full_viz, 'Alberta', 'Alberta')
pei = generate_cases_province(full_viz, 'Prince Edward Island', 'PEI')
nwt = generate_cases_province(full_viz, 'Northwest Territories', 'NWT')
nl = generate_cases_province(full_viz, 'Newfoundland and Labrador', 'NL')
yt = generate_cases_province(full_viz, 'Yukon', 'Yukon')
nv = generate_cases_province(full_viz, 'Nunavut', 'Nunavut')

In [None]:
prov_dict = {'Ontario': on,
             'Quebec': qb,
             'British Columbia': bc,
             'Saskatchewan': sk,
             'New Brunswick': nb,
             'Nova Scotia': ns,
             'Manitoba' : mb,
             'Alberta' : ab,
             'Prince Edward Island': pei,
             'Northwest Territories' : nwt,
             'Newfoundland and Labrador' : nl,
             'Yukon': yt,
             'Nunavut': nv}
prov_list = ['Ontario', 'Quebec', 'British Columbia', 'Saskatchewan', 
             'New Brunswick', 'Nova Scotia', 'Manitoba', 'Alberta',
             'Prince Edward Island', 'Northwest Territories', 'Newfoundland and Labrador', 'Yukon', 'Nunavut']

In [None]:
w_prov1 = widgets.Select(description="Province 1", options=prov_list)
w_prov2 = widgets.Select(description="Province 2", options=prov_list)
w_intervention_multi = widgets.SelectMultiple(description="Different intervention types",
                                             options=intervention_categories)
w_stats = widgets.Select(description="Different COVID-19 indicators",
                         options=['Daily Cases', 'Cumulative Cases'])
w_stringency = widgets.Select(description="Stringency Index", options=[1, 33, 50, 66, 100])

In [None]:
def compare_provinces_cases(prov1_str, prov2_str, interv_type, stat, stringency_idx):
    prov1 = prov_dict[prov1_str]
    prov2 = prov_dict[prov2_str]
    
    if stat == "Daily Cases":
        height = 50
    else:
        height = 500
    
    fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, sharey=True, figsize=(15, 10))
    colors = sns.color_palette("Set2", n_colors=8)
    legend_labels = [mpatches.Patch(color=colors[0], label=stat)]
    
    begin_date = np.datetime64('2020-02-15')
    
    if prov1['start_date'].values[-1] > prov2['start_date'].values[-1]:
        end_date = prov2['start_date'].values[-1]
    else:
        end_date = prov1['start_date'].values[-1]
        
    ax1.set_xlim(left=begin_date, right=end_date)
    
    ax1.bar(prov1['start_date'], prov1[stat], color=colors[0])
    ax1.set(ylabel=stat, title=prov1_str)
    
    ax2.bar(prov2['start_date'], prov2[stat], color=colors[0])
    ax2.set(xlabel = 'Date', ylabel=stat, title=prov2_str)
    
    for idx in range(len(interv_type)):
        if len(prov1[prov1[interv_type[idx]] >= stringency_idx]['start_date']) :
            first_date_1 = prov1[prov1[interv_type[idx]] >= stringency_idx]['start_date'].values[0]
            second_date_1 = first_date_1 + np.timedelta64(14,'D')
            
            ax1.axvline(x=first_date_1, linestyle='-', color=colors[1+idx])
            ax1.text(first_date_1+ np.timedelta64(8, 'h'), height, interv_type[idx],rotation=90)
            ax1.axvspan(second_date_1, end_date, facecolor= colors[1+idx], alpha=0.4)
            ax1.text(second_date_1+ np.timedelta64(8, 'h'), height,
                    interv_type[idx].split(' ')[0] + ' - after 14 days',rotation=90)
        
        if len(prov2[prov2[interv_type[idx]] >= stringency_idx]['start_date']) :
            first_date_2 = prov2[prov2[interv_type[idx]] >= stringency_idx]['start_date'].values[0]
            second_date_2 = first_date_2 + np.timedelta64(14,'D')
            
            ax2.axvline(x=first_date_2, linestyle='-', color=colors[1+idx])
            ax2.text(first_date_2+ np.timedelta64(8, 'h'),height, interv_type[idx],rotation=90)
            
            ax2.axvspan(second_date_2, end_date, facecolor= colors[1+idx], alpha=0.4)
            ax2.text(second_date_2+ np.timedelta64(8, 'h'), height,
                    interv_type[idx].split(' ')[0] + ' - after 14 days',rotation=90)
        
    
        legend_labels.append(mpatches.Patch(color=colors[1+idx], label=interv_type[idx]))  

    ax1.legend(handles=legend_labels, loc=2)
    ax2.legend(handles=legend_labels, loc=2)
    
    plt.show()

In [None]:
def view_case_provinces(p1, p2, i, nb, s):
    display(compare_provinces_cases(p1, p2, i, nb, s))

### Interactive Interventions and Case Data

In [None]:
interactive(view_case_provinces, p1=w_prov1, p2=w_prov2, i=w_intervention_multi,
            nb=w_stats, s=w_stringency)

### Intervention Impact on Mobility Explorer

In [None]:
%%html
<div class='tableauPlaceholder' id='viz1586995220180' style='position: relative'  ><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;On&#47;OntarioInterventions&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='OntarioInterventions&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;On&#47;OntarioInterventions&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1586995220180');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='90%';vizElement.style.height=(divElement.offsetWidth*0.65)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>


### Intervention Stringency by Geography

In [None]:
%%html
<div class='tableauPlaceholder' id='viz1587073081316' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;In&#47;Interventions_All_Provinces&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Interventions_All_Provinces&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;In&#47;Interventions_All_Provinces&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1587073081316');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

# Baseline Search: Oxford Response Labels

These are the oxford indicators found in the Canadian NPI dataset. It can be expected that they would provide a strong baseline for search results that can inform the Canadian response to COVID-19.

In [None]:
full_df['oxford_government_response_category'].astype(str).unique()

In [None]:
results_df = search(full_df['oxford_government_response_category'].astype(str).unique())

# Keyword Selection from LDA Topics on Canadian NPI Full Text (gensim)

The topic modeling algorithm that we have utilized in this approach is one well-known generative probabilistic model that is referred to as **Latent Dirichlet Allocation (LDA)**

LDA receives words as an input vector and generates topics which are probability distribution over words based on a generative process. LDA uses a joint probability distribution over both the observed and hidden random variables and compute the posterior distribution (conditional distribution) of the hidden variables given the observed variables. **The fundamental assumption of LDA is that documents can be assigned to multiple topics**. Another assumption is that topics are hidden variables, and words in documents are visible variables. Thus, LDA performs a generative process by receiving words (*apparent variables*) as an input vector to provide topics (*hidden variables*) which are **probability distribution over words**


## Import Packages
The main package that we have used are *gensim*, *nltk* and *pandas*

In [None]:
import os
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import gensim
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.models.phrases import Phrases, Phraser
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
import pandas
import re
import pandas
from pprint import pprint

## Read Data and Preprocessing
In this step we read the dataset and select the titles 

In [None]:
engl_df = full_df[full_df['region'] != 'Quebec']
full_text = engl_df['source_full_text'].drop_duplicates().astype(str)
data = full_text.values
data = [re.sub('\s+', ' ', text) for text in data] # remove new lines
data = [re.sub("\'", "", text) for text in data] # remove quotes
print("Total number of documents: ", len(data))

In [None]:
words = [row.split() for row in data]

print(words[13:14])



 ## Bigram and Trigram 
We need to provide Bigram and Trigram from the orginal texts. 

In [None]:

bigram = Phrases(words, min_count=30, progress_per=10000)
trigram = Phrases(bigram[words], threshold=100)
bigram_mod = Phraser(bigram)
trigram_mod = Phraser(trigram)

bigrams = [b for l in data for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]



## NLTK Stop words

In [None]:

stop_words = stopwords.words('english')
#stop_words.extend(['school', 'Nonpharmaceutical Interventions', 'education'])


In [None]:
r = [x.split() for x in full_df['region'].dropna().unique().tolist()]
r = np.hstack([np.array(x) for x in r])
sr = [x.split() for x in full_df['subregion'].dropna().unique().tolist()]
sr = np.hstack([np.array(x) for x in sr])
geo_stop_words = np.append(r, sr)
geo_stop_words = [x.lower() for x in geo_stop_words]
geo_stop_words = [x.replace('(','').replace(')','') for x in geo_stop_words]


In [None]:
stop_words.extend(geo_stop_words)


In [None]:
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

words_nostops = remove_stopwords(words)

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]
words_bigrams = make_bigrams(words_nostops)

## Create the Dictionary and Corpus needed for Topic Modeling
The LDA function receives *Dictionary* and *Corpus* as inpuut and provides *Topics* as output


We creat dictionary using *corpora.Dictionary*

In [None]:
id2word = corpora.Dictionary(words_bigrams)



## Building the Topic Model
To train the LDA model, we need to define 1) the corpus, 2) dictionary and 3)the number of topics. We also need to determine the values of hyperparameters such as *alpha* and *eta*. The defult values of these parametters are $1/#topic$. Another parameter is *chunksize* that determines the number of documents to be used in each training chunk. Finally *passes* is the total number of training passes. 


In [None]:
texts = words_bigrams
corpus = [id2word.doc2bow(text) for text in texts]

lda_model = gensim.models.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=35, 
                                       random_state=10,
                                       chunksize=100,
                                       passes=1,
                                       per_word_topics=True)
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

In [None]:
Perplexity = lda_model.log_perplexity(corpus)
print ("Perplexity:", Perplexity)


# Compute Coherence Score
A good LDA model can provide coherent topics. So its topic coherence is high. 

In [None]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=words_bigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

## What is an optimal LDA model ? 
How can we find the best values of Hyperparameters such as number of topics ( Model Hyperparameters) and alpha and beta that are reffered to as Document-Topic Density and and Word-Topic Density, respectivly and known as "Dirichlet hyperparameters". 

### 1. The number of Topics 

In [None]:
start=1
limit=50
step=2
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
    print('Topics: ', num_topics)
    lda_model = gensim.models.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                       random_state=10,
                                       chunksize=100,
                                       passes=4,
                                       per_word_topics=True)
    model_list.append(lda_model)
    coherence_model_lda_c_v = CoherenceModel(model=lda_model, texts=words_bigrams, corpus=corpus, dictionary=id2word, coherence="c_v")
    coherence_values.append(coherence_model_lda_c_v.get_coherence())




In [None]:
# Show graph
import matplotlib.pyplot as plt
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("CoherenceValues"), loc='best')
plt.show()



In [None]:
PT=coherence_values[10:40]
print(PT)

In [None]:
Optimal_N_Topic=35#PT.index(max(PT))+10
print(Optimal_N_Topic)

### Run LDA with the optimal number of topics 

In [None]:
texts = words_bigrams
corpus = [id2word.doc2bow(text) for text in texts]

Optimal_lda_model = gensim.models.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=Optimal_N_Topic, 
                                       random_state=10,
                                       chunksize=100,
                                       passes=1,
                                       per_word_topics=True)

pprint(Optimal_lda_model.print_topics())
doc_lda = Optimal_lda_model[corpus]

## visualization
*pyLDAvis* is a python package to provide interactive web-based visualization to describe the topics that have been provided by the LDA model. 

In [None]:
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt



In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(Optimal_lda_model, corpus, id2word)
vis

# Run Search

In [None]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)


In [None]:
topics = Optimal_lda_model.print_topics(num_topics=Optimal_N_Topic, num_words=20)
topics = [topic[1] for topic in topics]
topics = [topic.split('"') for topic in topics]
search_inputs = [" ".join(topic_keys[1::2]) for topic_keys in topics]
search_inputs = [''.join([i if ord(i) < 128 else ' ' for i in text]) for text in search_inputs]

In [None]:
results_lda_df = search(search_inputs)

# Comparison of Search Results


In [None]:
common_results = np.intersect1d(results_df['title'].unique(), results_lda_df['title'].unique())
print("results in common: ", len(common_results))
common_results

In [None]:
print("Baseline average score: ", results_df['score'].mean())
print("NPI-Context average score: ", results_lda_df['score'].mean())

In [None]:
new_results = results_lda_df[~results_lda_df['title'].isin(results_df['title'].unique())]

print("Baseline # results: ", results_df.shape[0])
print("NPI-Context results: ", results_lda_df.shape[0])
print("New results above baseline: ", new_results.shape[0])

## New Research Discovered through Context-aware search

The below results were all additional discoveries through the addition of NPI-context to the document search process that were not found with the baseline method.

In [None]:
new_results.sample(4)