# What has been published about ethical and social science considerations?
[COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)


**The end users of this Kaggle task are very likely to be decision-makers and policy-makers who need to better understand and navigate the ethical and social science considerations during the COVID-19 crisis response efforts. **

While researching the very challenging and open ended [questions posed by this Kaggle task](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=563), we take an approach inspired by Qualitative Social Science. Therefore we conducted a few unstructured interviews with policy-makers, ethics counselors, and front-line medical practitioners in order to better understand what are their biggest needs and how our work could be most helpful for them. The details that experts shared with us informed the data exploration and modeling presented in this notebook.

The goal of this notebook is to link public health measures and social and ethical concerns for individual articles in the corpus. In order to extract such information, we will utilize a language grammar. In our case we've defined a grammar that aims to capture how the literature discusses **factors such as the enablers, barriers, and implications of different public health measures**. Our current grammar is in English but our language grammar approach could be evolved into covering other languages as well.

Ultimately, we hope that we are able to **highlight how others have been able to address social and ethical considerations during critical pandemic response efforts**. In this way we aim to inform the decision-makers, ethicists who might be part of a hospital ethics committee, or other individuals who need to navigate difficult decisions.

## Contents

1. [Findings](#Findings)
2. [Approach](#Approach)
3. **[Utilizing a Grammar Annotation Pattern Engine for Natural Language Processing](#Grammar_Engine)** - Utilize regular expressions and language grammars in order to extract information about the relationship between public health measures and the social and ethical concerns in the corpus.
4. **[Calculating semantic similarieties using the Universal Sentence Encoder embeddings](#Semantic_Similarieties)** 
5. **[Saving the results into CSV](#Results)** - exporting CSV files with the paper excerpts which relate to the corresponding research questions  
6. **[Data Visualization](#Visualization)** - Summarizing the results in a data visualization
7. [Next Steps](#Next_Steps) - What we plan to work on next



# Findings

We summarize our results in the data visualization below. It shows how many of the over 40,000 full text research articles in the dataset discuss the social science and ethical considerations related to the public health measures listed below. Hovering over the data shows you the names and authors of the top 3 most related research papers in each category. To inspect the full list of results see the next subsection or download the CSV files from the notebook output directory.

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

oxford_indicators = [
    "School or university closing",
    "Workplace closing",
    "Cancel public events",
    "Close public transport",
    "Public information campaigns",
    "Restrictions on internal movement",
    "International travel restrictions and control",
    "Fiscal measures",
    "Monetary measures",
    "Emergency investment in healthcare",
    "Investment in vaccines",
    "Testing policy",
    "Contact tracing"
]

# Read cached results
cached_plot_df = pd.read_csv("../input/cached-outputs/cached_plot_matches.csv", na_values='')

title = 'Barriers and enablers (RED), and implications (BLUE) of the uptake of public health measures <br>given the variation in policy responses, <i>hover over dots for top matches</i>'
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(cached_plot_df['count_matches'])[::-1],
    y=list(oxford_indicators)[::-1],
    marker=dict(color="crimson", size=12),
    mode="markers",
    hovertemplate = '<b>%{x}: %{y} %{text}',
    text = cached_plot_df['top_matches_uptake'],
    showlegend = False))

fig.add_trace(go.Scatter(
    x=list(cached_plot_df['count_matches_impl'])[::-1],
    y=list(oxford_indicators)[::-1],
    marker=dict(color="blue", size=12),
    mode="markers",
    hovertemplate = '<b>%{x}: %{y} %{text}',
    text = cached_plot_df['top_matches_impl'],
    showlegend = False))


fig.add_annotation

fig.update_layout(title=title,
              hoverlabel_align = 'left',
              hovermode='closest',
              xaxis_title='Counts of references in the corpus',
              yaxis_title='Policy Responses')

fig.show()

(please wait for the graph to load)

We found that the [Oxford COVID-19 Government Response Tracker (OxCGRT) white paper](https://www.bsg.ox.ac.uk/sites/default/files/2020-04/BSG-WP-2020-031-v4.0_0.pdf) by the Blavatnik School of Government, University of Oxford, provides a very helpful categorization of public health measures. We have adopted their categorization in order to present our results and allow decision-makers to inspect them at a more granular level. The OxCGRT project collects information about the variation in government responses to COVID-19. They do that through a metrics framework consisting of 13 indicators (in the table below). In our proposal, we aim to further identify the social and ethical concerns related to each of these indicators. See the output csv files for full details as well as a short summary of our findings in the table below. 

| Indicator | Description (privided by OxCGRT) | Social & Ethical Concerns  | Number of papers<br>discussing the concerns | Example paper excerpt discussing the indicator | Name of the CSV file with<br>all papers details, excerpts,<br>and similarity scores  |
|---|---|---|---|---|
| School or university closing | Record closings of schools and universities | Efforts to support sustained education, access, and capacity building in the area of ethics  | 60  | Public health measures can be highly effective, especially when they are part of a structured programme that includes instruction and education and when they are delivered together  | policy_measures_<br>indicator_0_findings<br>_plus_excerpts.csv |
| Workplace closing | Record closings of workplaces  | Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the livelihood of people  | 69  |  Therefore, to identify factors that may lead to disproportionate vulnerability in the event of a serious outbreak, we used multivariable logistic regression to model the predicted probability that some groups of working adults (delineated by employment characteristics such as inability to work from home, lack of pay when absent from work, and self-employment) may be less able than identifi ed referent groups to comply with pandemic infl uenza mitigation strategies that require voluntary isolation from work  | policy_measures_<br>indicator_1_findings<br>_plus_excerpts.csv |
| Cancel public events | Record cancelling public events  |  Efforts to embed ethics across all thematic areas, engage with novel ethical issues that arise | 63  |  Examples of social distancing include closing of schools, non-essential workplace closures, and avoiding large gatherings (public transport, concerts, religious gatherings) in which a large number of individuals are in close contact and facilitating the contagion process | policy_measures_<br>indicator_2_findings<br>_plus_excerpts.csv |
| Close public transport | Record closing of public transport  | Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the livelihood of people  | 26  |    As summarized inthe Appendix, partial lockdown includes "closed-off management" on highways, railways and public transport systems; and sets up checkpoints to control the inflow population, and implements surveillance and tighter controls in each neighborhood | policy_measures_<br>indicator_3_findings<br>_plus_excerpts.csv |
| Public information campaigns |  Record presence of public information campaigns | Efforts to support awareness and sustained education, access, and capacity building in the area of ethics  |  214 |  Especially in major emergencies involving national security and the safety of lives and property, the government should decisively intervene in publicity control and information dissemination | policy_measures_<br>indicator_4_findings<br>_plus_excerpts.csv |
| Restrictions on internal movement | Record closing of public transport | Efforts to support awareness and sustained education, access, and capacity building in the area of ethics    | 461  |  Very few jurisdictions have articulated explicit procedures and policies to determine whether or not an individual should be subject to quarantine  | policy_measures_<br>indicator_5_findings<br>_plus_excerpts.csv |
| International travel restrictions and control | Record restrictions on international travel  |  Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the livelihood of people | 98  | Travel-related policy and public health responses to epidemics take several approaches, and include travel restrictions to and from affected areas, and passenger screening  | policy_measures_<br>indicator_6_findings<br>_plus_excerpts.csv |
| Fiscal measures |  What economic stimulus policies are adopted? |  Social concerns about job loss and the exacerbation of income inequality |  38 |  The focused measures include single parent allowance, transportation subsidy for low-income families, job training programs for the unemployed, and so on  | policy_measures_<br>indicator_7_findings<br>_plus_excerpts.csv |
| Monetary measures |  Monetary economic stimulus |  Social concerns about the value of interest rate | 38  |  In the event of a large-scale or acute emergency response, HHS may request supplemental appropriations from the US Congress specifically for that emergency or may consider reprogramming funds  | policy_measures_<br>indicator_8_findings<br>_plus_excerpts.csv |
| Emergency investment in healthcare | Short-term spending on, e.g, hospitals, masks, etc  | Social and ethical concerns related to the healthcare system | 435  | The inability to obtain the masks specified in the government directives led to considerable staff alarm and raised the question whether staff should be allowed to see patients if they could not be provided the appropriate mask  | policy_measures_<br>indicator_9_findings<br>_plus_excerpts.csv |
| Investment in vaccines | Announced public spending on vaccine development  | Social and ethical concerns related to vaccines | 309  | About one-third of the patients who did not receive the vaccination reported that there was no need to receive the vaccine, suggesting more intensive intervention is needed to understand the reasons behind such thoughts and to motivate this group to consider the vaccination  | policy_measures_<br>indicator_10_findings<br>_plus_excerpts.csv |
| Testing policy | Who can get tested?  | Efforts to identify issues in the access to tests  | 138  |  The vulnerability of emergency response resources to natural events and hazardous material releases should be assessed | policy_measures_<br>indicator_11_findings<br>_plus_excerpts.csv |
| Contact Tracing | Are governments doing contact tracing? |  Efforts to identify issues related to surveillance and privacy related to the use of contact tracing technology | 175  |   The key differences (revealed in the comments) were the extent to which participants were concerned about the lack of a clear legal and social mandate and the potential for negative Panellist's preferences as to how public health authorities should contact someone flagged by an online event-based communicable disease surveillance system public reactions | policy_measures_<br>indicator_12_findings<br>_plus_excerpts.csv |

 
---

Our conversations with experts as well as qualitative research and overview of the CORD-19 dataset highlighted that there's **a need for management of scarce resources relative to outcome predictions**.

The following ethical issues exist:
* Ethical concerns related to the health and well-being of patients: 
  * Who gets admitted to the hospital?
  * Who gets a ventilator?
  * Who makes the decision whether a Cardiopulmonary Resuscitation (CPR) is performed? 
  CPR uses mouth-to-mouth or machine breathing and chest compressions to restore the work of the heart and lungs when someone's heart or breathing has stopped. 
* Ethical concerns related to the well-being of caregivers:
  * Should healthcare practitioners be treated in the same way? 
  Nurses may be forced to work without adequate personal protective equipment to keep them safe. Should they be expected to risk their lives?
  * Who makes the decision on who are the "critical workers" that need to be working?


# Approach

Here we describe the technical approach that lead to these findings. We first load needed python libraries and the corpus metadata file.

In [None]:
import covid19_tools as cv19
import pandas as pd
import re
from IPython.core.display import display, HTML
import html
import numpy as np
import json
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import glob
import os

METADATA_FILE = '../input/CORD-19-research-challenge/metadata.csv'

# Load metadata
meta = cv19.load_metadata(METADATA_FILE)

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

# Grammar_Engine
## Defining the grammar

We've built the grammar using [Unitex/GramLab](https://unitexgramlab.org/) - an open source, cross-platform, multilingual, lexicon- and grammar-based corpus processing tool. We then convert to python using the https://github.com/GrapeNLP/grapenlp-core library.

You can reproduce our results by using the existing grammar file which is located in the "input/grammar" directory or by creating your own grammar in Unitex.

We first setup the [GrapeNLP](https://github.com/GrapeNLP/pygrapenlp) library which we use in order to process the grammar graph file in python.

In [None]:
!dpkg -i ../input/libgrapenlp/libgrapenlp_2.8.0-0ubuntu1_xenial_amd64.deb
!dpkg -i ../input/libgrapenlp/libgrapenlp-dev_2.8.0-0ubuntu1_xenial_amd64.deb
!pip install pygrapenlp

from collections import OrderedDict
from pygrapenlp import u_out_bound_trie_string_to_string
from pygrapenlp.grammar_engine import GrammarEngine

Here is our [UniText](https://unitexgramlab.org/) grammar graph. The grammar aims to help us answer the following questions:
* What are the enablers and barriers for the uptake of public health measures?
* What is the impact of public health measures for prevention and control? This includes the rapid identification of the secondary impacts of these measures. (e.g. use of surgical masks, modification of health seeking behaviors for SRH, school closures)?

In [None]:
from ipywidgets import Image
f = open("../input/grammar/grammar.png", "rb")
image = f.read()
Image(value=image)

The grammar takes text input and tags it according to the graph. In our case, the tags which we seek to apply are "list_phm_uptake" and "list_phm_impl". We could have achieved similar results with a large number of regular expressions, however grammars are a much more flexible approach to complex natural language processing. We experimented with a number of different grammars and achieved best results with the grammar described here. 
* the tag "list_phm_uptake" aims to extract a list of factors on which public health measures depend on
* the tag "list_phm_impl" aims to extract a list of implications of different public health measures

In the next iterations of this project we aim to create more grammars exploring the rest of the task questions.

## Processing

We first split each document into a list of sentences and then process each sentence through the grammar.

Next, we define a helper function that splits a single document into a list of sentences. In order to use the '.' character as a separator between sentences, we will:
 - replace mentions of 'et al.' with 'et al' 
 - replace mentions of '...' with ''
 - 'etc' will most likely appear at the end of the sentence and therefore might not be a problem
 - replace i.e. with ':' 
 - Table and figure numbers, for example: (Table 11 .4), ( Fig. 11 .1B) Fig. 11 .5 shows
 - numbered lists '1.' etc.
 - abbreviations 'Scalpels with no. 11 or no. 22 ', 'while some agents such as F. rodentium and C. piliforme'

In [None]:
def parse_into_sentences(text):
    regex_list = [re.compile(r"([ (\[]%s([.1-9 ]+[A-Z]?[a-z]?)*[ )\]])" % name, re.IGNORECASE) for name in ["fig", "figure", "table"]]
    regex_list.extend([re.compile(r"%s" % term, re.IGNORECASE) for term in ["et al\.", "i\.e\.", "e\.g\.", " [1-9]\.", "\.\.\.", "\.(?=\))"]])
    regex_list.append(re.compile(r" [A-Z]\."))

    preprocessed_text = text
    #preprocessed_text
    for r in regex_list:
        preprocessed_text = r.sub("", preprocessed_text)

    sentences_list = preprocessed_text.split('.')
    
    return sentences_list

Next we define helper functions which process the grammar (according to the pygrapenlp library):

In [None]:
def native_results_to_python_dic(sentence, native_results):
    top_segments = OrderedDict()
    if not native_results.empty():
        top_native_result = native_results.get_elem_at(0)
        top_native_result_segments = top_native_result.ssa
        for i in range(0, top_native_result_segments.size()):
            native_segment = top_native_result_segments.get_elem_at(i)
            native_segment_label = native_segment.name
            segment_label = u_out_bound_trie_string_to_string(native_segment_label)
            segment = OrderedDict()
            segment['value'] = sentence[native_segment.begin:native_segment.end]
            segment['start'] = native_segment.begin
            segment['end'] = native_segment.end
            top_segments[segment_label] = segment
    return top_segments

def execute_phm_soc_grammar():
    base_dir = os.path.join('..', 'input', 'grammar')
    grammar_pathname = os.path.join(base_dir, 'soc_grammar.fst2')
    bin_delaf_pathname = os.path.join(base_dir, 'test_delaf.bin')
    grammar_engine = GrammarEngine(grammar_pathname, bin_delaf_pathname)

    df = pd.DataFrame([], columns=['index', 'list_phm_impl', 'list_phm_uptake', 'sentence'])

    for record_id in range(len(full_text_repr)):
        record_text = "".join([item['text'] for item in full_text_repr[record_id]['body_text']])
        record_sentences = parse_into_sentences(record_text)
        processed = 0

        for sentence in record_sentences:
            context = {}
            native_results = grammar_engine.tag(sentence, context)
            matches = native_results_to_python_dic(sentence, native_results)

            if 'list_phm_impl' in matches or 'list_phm_uptake' in matches:
                df = df.append(
                {
                    'index': record_id, 
                    'list_phm_impl': matches['list_phm_impl']['value'] if 'list_phm_impl' in matches else '', 
                    'list_phm_uptake': matches['list_phm_uptake']['value'] if 'list_phm_uptake' in matches else '',
                    'sentence': sentence

                }, ignore_index=True)
                processed += 1

        # print("record %d, processed %d sentences, found %d matches" % (record_id, len(record_sentences), processed))

    print("Processed %d documents, found %d matches" % (len(full_text_repr), df.shape[0]))
    df.to_csv('output.csv')
    return df

We load the documents in the corpus which mention any of the specific terms required by the grammar. In this way we also narrow down the documents we're working on instead of operating on the whole corpus. If we change the grammar we need to consider changing the grammar terms below in order to ensure we're not excluding documents which might contain matches.

In [None]:
grammar_terms = ['public health', 'intervention', 'policy', 'policies', 'quarantine', 'lockdown', 'contact tracing', 'distancing', 'emergency']

meta, soc_ethic_counts = cv19.count_and_tag(meta,
                                               grammar_terms,
                                               'soc_ethic')
number_of_SOC_articles = len(meta[meta.tag_soc_ethic == True])
print('Number of articles is ', number_of_SOC_articles) 

print('Loading raw data ...')
metadata_filter = meta[meta.tag_soc_ethic == True] 
full_text_repr = cv19.load_full_text(metadata_filter,
                                     '../input/CORD-19-research-challenge')

We then define a function which will pass all documents in the corpus through the grammar and return the tagged matches in a dataframe.

In [None]:
def execute_phm_extract_grammar():
    base_dir = os.path.join('..', 'input', 'grammar')
    bin_delaf_pathname = os.path.join(base_dir, 'test_delaf.bin')
    grammar_pathname = os.path.join(base_dir, 'phm_list.fst2')
    grammar_engine = GrammarEngine(grammar_pathname, bin_delaf_pathname)

    df = pd.DataFrame([], columns=['index', 'list_phm', 'sentence'])

    for record_id in range(len(full_text_repr)):
        record_text = "".join([item['text'] for item in full_text_repr[record_id]['body_text']])
        record_sentences = parse_into_sentences(record_text)
        processed = 0

        for sentence in record_sentences:
            context = {}
            native_results = grammar_engine.tag(sentence, context)
            matches = native_results_to_python_dic(sentence, native_results)

            if 'list_phm' in matches:
                df = df.append(
                {
                    'index': record_id, 
                    'list_phm': matches['list_phm']['value'] if 'list_phm' in matches else '', 
                    'sentence': sentence

                }, ignore_index=True)
                processed += 1

        # print("record %d, processed %d sentences, found %d matches" % (record_id, len(record_sentences), processed))

    print("Processed %d documents, found %d matches" % (len(full_text_repr), df.shape[0]))
    
    return df

Finally we execute the grammar and save the results.

In [None]:
df_grammar_output = execute_phm_extract_grammar()
df_grammar_output.to_csv('grammar_output_uptake_impl.csv')

# Semantic_Similarieties

Download and load the tensorflow [Universal Sentence Encoder model](https://tfhub.dev/google/universal-sentence-encoder/4) which encodes text into high-dimensional vectors that we will use for calculating semantic similarity.

In [None]:
import tensorflow as tf

# download the Universal Sentence Encoder module and uncompress it to the destination folder. 
!mkdir /kaggle/working/universal_sentence_encoder/
!curl -L 'https://tfhub.dev/google/universal-sentence-encoder-large/5?tf-hub-format=compressed' | tar -zxvC /kaggle/working/universal_sentence_encoder/

In [None]:
import tensorflow_hub as hub

# load the Universal Sentence Encoder module
use_embed = hub.load('/kaggle/working/universal_sentence_encoder/')

In [None]:
# Read cached grammar output and transform to add the paper id / SHA value
df = pd.read_csv("../input/cached-outputs/grammar_output_uptake_impl.csv", na_values='')
df = df.drop(columns=['Unnamed: 0'])
df_full_text_repr = pd.DataFrame(full_text_repr)
paper_ids = df_full_text_repr['paper_id']
df['paper_id'] = df.apply(lambda row: paper_ids[row['index']], axis=1)

df.head()

We further annotate the grammar output file with some of the data attributes in the metadata file

In [None]:
processed = pd.melt(df, id_vars=["paper_id", "sentence"], value_vars=["list_phm_impl", "list_phm_uptake"], value_name="match").dropna()
processed.set_index('paper_id')

processed['cord_uid'] = np.nan
processed['title'] = np.nan
processed['doi'] = np.nan
processed['url'] = np.nan
processed['authors'] = np.nan
processed['authors_short'] = np.nan
processed['journal'] = np.nan

for i, row in processed.iterrows():
    meta_raw = metadata_filter.loc[metadata_filter['sha'] == row['paper_id']]
    #if len(meta_raw) == 1:
    if 'title' in meta_raw:
         processed.loc[i, 'cord_uid'] = meta_raw['cord_uid'].values[0]
         processed.loc[i, 'title'] = meta_raw['title'].values[0]
         processed.loc[i, 'doi'] = meta_raw['doi'].values[0]
         processed.loc[i, 'url'] = meta_raw['url'].values[0]
         processed.loc[i, 'authors'] = meta_raw['authors'].values[0]
         processed.loc[i, 'authors_short'] = meta_raw['authors_short'].values[0]
         processed.loc[i, 'journal'] = meta_raw['journal'].values[0]

processed = processed.rename(columns={"sentence": "excerpt", "variable": "type"})
processed.head()

We now want to group our findings into categories which relate to social and ethical issues.

We found that the [Oxford COVID-19 Government Response Tracker (OxCGRT) white paper](https://www.bsg.ox.ac.uk/sites/default/files/2020-04/BSG-WP-2020-031-v4.0_0.pdf) by the Blavatnik School of Government, University of Oxford, is helpful in providing a categorization of public health measures. We have adopted their categorization in order to present our results to policymakers. The OxCGRT project collects information about the variation in government responses to COVID-19. They do that through a metrics framework consisting of 13 indicators (defined below). 

In [None]:
oxford_indicators = [
    "School or university closing",
    "Workplace closing",
    "Cancel public events",
    "Close public transport",
    "Public information campaigns",
    "Restrictions on internal movement",
    "International travel restrictions and control",
    "Fiscal measures",
    "Monetary measures",
    "Emergency investment in healthcare",
    "Investment in vaccines",
    "Testing policy",
    "Contact tracing"
]
oxford_indicators_descriptive = [
    "school university and educational facilities",
    "workplace work job",
    "events gatherings of people",
    "public transport bus train",
    "public information campaigns to raise awareness",
    "strict lockdown quarantine",
    "international travel restrictions airport",
    "economy job loss low-income policies adopted fiscal tax spending",
    "value of interest rate",
    "investment hospitals masks healthcare public health patient",
    "vaccine development",
    "testing policy",
    "contact tracing surveillance privacy"
]

oxford_indicators_embeddings = use_embed(oxford_indicators_descriptive)

Calculate the semantic similarity between each of the policy responses and the matches returned by the language grammar.

In [None]:
CORR_THRESHOLD = 0.2
BATCH_SIZE = 100
NUM_TOP_MATCHES_TO_RETURN = 3
MAX_CHARS_TOOLTIP = 50

def extract_grammar_matches_uptake_for_public_health_measures(df_input, grammar_label, column_to_match):
    phm_uptake = df_input[df_input['type'] == grammar_label]

    count_matches = np.zeros(len(oxford_indicators))
    df_matches = []

    for idx in range(int(len(phm_uptake)/BATCH_SIZE)):
        if idx*BATCH_SIZE > len(phm_uptake):
            break
        idx_start = idx*BATCH_SIZE
        idx_end = min((idx+1)*BATCH_SIZE, len(phm_uptake))

        phm_uptake_val = phm_uptake[idx_start:idx_end]
        phm_uptake_val = phm_uptake_val.reindex()
        phm_uptake_embeddings = use_embed(phm_uptake_val[column_to_match])
        corr_uptake = np.inner(oxford_indicators_embeddings, phm_uptake_embeddings)

        for i in range(len(phm_uptake_embeddings)):
            for j in range(len(oxford_indicators_embeddings)):
                if corr_uptake[j][i] > CORR_THRESHOLD:
                    count_matches[j] += 1
                    df_matches.append([j, 
                                       corr_uptake[j][i], 
                                       phm_uptake_val.iloc[i]['paper_id'], 
                                       phm_uptake_val.iloc[i]['excerpt'],
                                       phm_uptake_val.iloc[i]['cord_uid'],
                                       phm_uptake_val.iloc[i]['title'],
                                       phm_uptake_val.iloc[i]['doi'],
                                       phm_uptake_val.iloc[i]['url'],
                                       phm_uptake_val.iloc[i]['authors'],
                                       phm_uptake_val.iloc[i]['authors_short'],
                                       phm_uptake_val.iloc[i]['journal']]
                                     )

                    # print(" (%d, %d) Found macth %f\n ---\t%s\n ---\t%s\n ---\t%s" % (i, j, corr_uptake[j][i], oxford_indicators[j], phm_uptake_val.iloc[i], phm_uptake.iloc[i]['sentence']))
        # print("processed batch %d - %d" % (idx_start, idx_end))

        del corr_uptake
        del phm_uptake_val
        del phm_uptake_embeddings
        
    return count_matches, df_matches 

In [None]:
count_matches_uptake, df_matches = extract_grammar_matches_uptake_for_public_health_measures(processed, 'list_phm_uptake', 'match')
df_matches_uptake = pd.DataFrame(df_matches)
df_matches_uptake.columns = ["policy_response", "match_score", "paper_id", "excerpt", "cord_uid", "title", "doi", "url", "authors", "authors_short", "journal"]

# Results

Inspect the results for a concrete category of policy response measures

In [None]:
indicator_matches = df_matches_uptake
indicator_matches = df_matches_uptake[df_matches_uptake['policy_response'] == 0]
indicator_matches.sort_values(by='match_score', ascending=False, inplace=True)
print("Found %d entries corresponding to the indicator %s" % (len(indicator_matches), oxford_indicators[0]))
indicator_matches.head()

Persist results to the output directory

In [None]:
for indicator_id in range(len(oxford_indicators)): 
    indicator_matches = df_matches_uptake
    indicator_matches = indicator_matches[indicator_matches['policy_response'] == indicator_id]
    indicator_matches.sort_values(by='match_score', ascending=False, inplace=True)

    to_save = indicator_matches.drop(['paper_id'], axis=1)
    to_save.to_csv('policy_measures_indicator_%d_findings_plus_excerpts.csv' % indicator_id)
    print("Saved %d entries corresponding to the indicator %s" % (len(to_save), oxford_indicators[indicator_id]))

Define a helper function to create summaries which we will use in the data visualization below

In [None]:

def generate_plot_labels(df_matches, df_input):
    res = []
    df_input = df_input.set_index('paper_id')

    for ox_ind in range(len(oxford_indicators)):
        df_ = df_matches[df_matches['policy_response'] == ox_ind]
        df_.sort_values(by='match_score', ascending=False, inplace=True)
        combined = ''
        num_included = 0
        included = []

        for p in range(len(oxford_indicators)):
            if num_included == NUM_TOP_MATCHES_TO_RETURN: 
                break
            record_id = df_['paper_id'].iloc[p]
            if record_id not in included:
                included.append(record_id)
                record_details = ""
                
                if 'title' in df_input.loc[record_id]:
                    if type(df_input.loc[record_id]['title']) == str :
                        record_details = "- \"%s\" by %s" % (df_input.loc[record_id]['title'], df_input.loc[record_id]['authors_short'])
                    else:
                        record_details = "- \"%s\" by %s" % (df_input.loc[record_id]['title'].values[0], df_input.loc[record_id]['authors_short'].values[0])
                    num_included += 1
                    
                    if len(record_details) > MAX_CHARS_TOOLTIP:
                        r = ""
                        out = [(record_details[i:i+MAX_CHARS_TOOLTIP]) for i in range(0, len(record_details), MAX_CHARS_TOOLTIP)] 
                        for o in out:
                            r = "%s<br>%s" % (r, o)
                        record_details = r[4:]
                        
                    combined = "%s<br>%s" % (combined, record_details)
                        

        res.append(combined)
        
    return res

top_matches_uptake = generate_plot_labels(df_matches_uptake, processed)

Extract grammar matches about the implications of policy response measures:

In [None]:
count_matches_impl, df_matches = extract_grammar_matches_uptake_for_public_health_measures(processed, 'list_phm_impl', 'match')
df_matches_impl = pd.DataFrame(df_matches)
df_matches_impl.columns = ["policy_response", "match_score", "paper_id", "excerpt", "cord_uid", "title", "doi", "url", "authors", "authors_short", "journal"]
top_matches_impl = generate_plot_labels(df_matches_impl, processed)

In [None]:
# Save outputs for quick access

cached_plot = pd.DataFrame()
cached_plot['count_matches'] = count_matches_uptake
cached_plot['count_matches_impl'] = count_matches_impl
cached_plot['top_matches_uptake'] = top_matches_uptake
cached_plot['top_matches_impl'] = top_matches_impl

cached_plot.to_csv('/kaggle/working/cached_plot_matches.csv', index = False)

# Visualization

We summarize the results in the data visualization. The plot includes two traces, one of them relates to the "list_phm_impl" tag in the grammar - exploring the implications of the public health measures, and the other to the "list_phm_uptake" tag which explores the factors which are barries and/or enablers for the uptake of public health measures. Through the data exploration step we found that many factors can be seen as a barrier and as an enabler at the same time.

The plot incorporates only the top number of matches based on the semantic similarities between the raw matches and the corresponding policy measures.

In [None]:
import plotly.express as px
import plotly.graph_objects as go

title = 'Barriers, enablers (RED), and implications (BLUE) of the uptake of public health measures <br>given the variation in policy responses, <i>hover over dots for top matches</i>'
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(count_matches_uptake)[::-1],
    y=list(oxford_indicators)[::-1],
    marker=dict(color="crimson", size=12),
    mode="markers",
    hovertemplate = '<b>%{x}: %{y} %{text}',
    text = top_matches_uptake,
    showlegend = False))

fig.add_trace(go.Scatter(
    x=list(count_matches_impl)[::-1],
    y=list(oxford_indicators)[::-1],
    marker=dict(color="blue", size=12),
    mode="markers",
    hovertemplate = '<b>%{x}: %{y} %{text}',
    text = top_matches_impl,
    showlegend = False))


fig.add_annotation

fig.update_layout(title=title,
              hoverlabel_align = 'left',
              hovermode='closest',
              xaxis_title='Counts of references in the corpus',
              yaxis_title='Policy Responses')

fig.show()

In [None]:
# Clean up before transitioning into the next section of the notebook
del full_text_repr

# Next_Steps

## Further annotating the dataset with information about the social context discussed in each publication
When answering each question in the Kaggle task, we think that it is critical to provide the context of the answer. For example: (1) country or continent, (2) time, and (3) severity of the pandemic which is being discussed in the paper. The reason for doing this is that if we don't do it many of the answers to the questions might be misleading as they cannot easily generalize over a large number of dynamic factors. The answers of these questions are contextual and depend on many sociocultural factors. For example we need to consider that:
* Cultural differences between countries around the world will influence what public health measures are implemented.
* Many of the papers in our dataset refer to previous pandemics. It will be important to compare between pandemics as the context has changed immensely.
* The severity of the pandemic will impact the measures being considered during the emergency response efforts.

In future work, we hope to be able to extract this contextual information and make it easier for decision-makers to understand it.

## Improving the grammar
The language grammar NLU approach is very powerful and utilized by many chatbot systems. We hope to continue work on refining the grammar we use in order to capture the complex language structures which are often used in research articles.

## Topic Modeling and building a knowledge graph
We aim to also explore a topic modelling approach in order to identify social and ethical concerns across the entire corpus. 
Ultimately, we want to build a knowledge graph where concepts of public health measures will be associated with concepts relating to social and ethical concerns. Using a graph based approach could allow for extracting insights based on multiple articles altogether vs extracting grammar matches on individual articles. 

**We hope our research findings will meaningfully contribute to the COVID-19 outbreak response efforts by providing decision-makers and policy-makers with examples of how others have navigated the complex social and ethical challenges posed by this Kaggle task.**