# COVID-19 Open Research Dataset Challenge (CORD-19)
# What has been published about ethical and social science considerations?

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=563

Specifically, we want to know what the literature reports about:

Efforts to articulate and translate existing ethical principles and standards to salient issues in COVID-2019
Efforts to embed ethics across all thematic areas, engage with novel ethical issues that arise and coordinate to minimize duplication of oversight
Efforts to support sustained education, access, and capacity building in the area of ethics
Efforts to establish a team at WHO that will be integrated within multidisciplinary research and operational platforms and that will connect with existing and expanded global networks of social sciences.
Efforts to develop qualitative assessment frameworks to systematically collect information related to local barriers and enablers for the uptake and adherence to public health measures for prevention and control. This includes the rapid identification of the secondary impacts of these measures. (e.g. use of surgical masks, modification of health seeking behaviors for SRH, school closures)
Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients and identify the immediate needs that must be addressed.
Efforts to identify the underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media.

# Results and discussion

The following results were manually selected to highlight research and news articles that are relevant to information-sharing and collaboration:
* [Panglobalism and pandemics: ecological and ethical concerns.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2259162/)
* [Scientific and ethical basis for social-distancing interventions against COVID-19](https://doi.org/10.1016/s1473-3099%2820%2930190-0)
* [Pandethics Summary This paper explains the ethical importance of infectious diseases, and reviews four major ethical issues associated with pandemic influenza: the obligation of individuals to avoid infecting others, healthcare workers' ‘duty to treat’, allocation of scarce resources, and coercive social distancing measures. ](https://doi.org/10.1016/j.puhe.2008.12.005)
* [The Role of the Global Health Development/Eastern Mediterranean Public Health Network and the Eastern Mediterranean Field Epidemiology Training Programs in Preparedness for COVID-19](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7104707/)
* [CIBT Education Group Provides Business Update Relating to the Coronavirus Outbreak.](https://www.globalbankingandfinance.com/category/news/cibt-education-group-provides-business-update-relating-to-the-coronavirus-outbreak/)
* [Team Epi-Aid: Graduate Student Assistance with Urgent Public Health Response Team Epi-Aid provides graduate students with practical public health experience through participation in outbreak investigations and other applied projects with state and local health departments in North Carolina](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2569985/)
* [Chapter 3 Measuring, Monitoring, and Evaluating the Health of a Population Abstract Public health depends on information derived from monitoring population health status to identify community health problems, and to diagnose and investigate health problems and hazards in the community.](https://doi.org/10.1016/b978-0-12-415766-8.00003-3)
* [A review of instruments assessing public health preparedness.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1497752/)
* [Ethical Obligations of Physicians Participating in Public Health Quarantine and Isolation Measures](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2099320/)
* [Informed public against false rumor in the social media era: Focusing on social media dependency](https://doi.org/10.1016/j.tele.2017.12.017)
* [U.K. fights coronavirus disinformation with rapid response team.](https://venturebeat.com/2020/03/30/u-k-fights-covid-19-disinformation-with-rapid-response-team/)


# Highlights and suggestions for improvements

The advantages of this work includes:
- Use of news content for greater information coverage. For example, the "[U.K. fights coronavirus disinformation with rapid response team.](https://venturebeat.com/2020/03/30/u-k-fights-covid-19-disinformation-with-rapid-response-team/)" resource was obtained from the news dataset. Another example includes "[CIBT Education Group Provides Business Update Relating to the Coronavirus Outbreak.](https://www.globalbankingandfinance.com/category/news/cibt-education-group-provides-business-update-relating-to-the-coronavirus-outbreak/)".
- Accurate results
- Data table fromatting with clickable URLs
- Simple data pipeline to understand and re-use: Users can easily enter a query and find relevant documents by simply calling
> q='equity considerations and problems of inequity'
<br>
> search(q, 'inequity')
- Results are summarized as a WordCloud image

Suggestions for improvements:
- To speed up the process and minimize computation, only the title and part of the abstract / news content are used in the analysis. Analyzing full-text may reveal additional information.
- Results can be fed into a document summarization algorithm so results can be more focussed even further
- Further improvements in how data is presented with a more advanced user interface can be beneficial


# Methodology
Data sources:
- [CORD-19 Research Papers](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
- [COVID-19 Public Media Dataset](https://www.kaggle.com/jannalipenkova/covid19-public-media-dataset)

To build the corpus, the title and the content from both data soucres are combined (only the first 3,000 characters are used). The task questions were used as search queries for relevant documents in the corpus. For each document in the corpus and search query pair, the word vectors are retrieved from the GoogleNews word embeddings and the cosine distance is calculated. The top documents that match each query are shown in this notebook together with a summary of results presented as a WordCloud image. Additional results are available as CSV files.

I experimented with using the Word Mover's Distance measure as an alternative similarity measure to cosine distance however, it proved to be a lot slower and it was difficult to objectively decide if the results were better than cosine distance. I also tried using LDA (latent dirichlet allocation) to identify the major topics in the results but I decided to use WordClouds instead because it is more visually appealing and more keywords are highlighted.


I hope that the findings from this notebook can help inform researchers and curious minds about the ongoing COVID-19 research. I want to thank the researchers, competition organizers, Kaggle and the dataset providers for making this work possible.

In [None]:
# imports
import numpy as np
import matplotlib.pyplot as plt
import json
import requests
import io
import gc
import re

import logging

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
from wordcloud import WordCloud

# from tqdm import tqdm
from tqdm.notebook import tqdm


# pandas settings
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 1000)
plt.rcParams['figure.figsize'] = [12, 8]


from nltk import download
download('stopwords')  # Download stopwords list.
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk import word_tokenize
download('punkt')  # Download data for tokenizer.

plt.rcParams['figure.figsize'] = [12, 8]

# Prepare data

Gather title and abstract from COVID19 articles and titles from news upto a maximum number of characters.


In [None]:
MAX_LEN = 3000   # 3000 chars

research = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv')
research['title_abstract'] = [str(research.loc[i,'title']) + ' ' + str(research.loc[i,'abstract']) for i in research.index ]
research['source'] = 'research'
research

news = pd.read_csv('/kaggle/input/covid19-public-media-dataset/covid19_articles.csv')
del news["Unnamed: 0"]
news['source'] = 'news'
news['title_abstract'] = [ news.loc[i,'title'] + '. ' + news.loc[i,'content'][:(MAX_LEN-len(news.loc[i,'title']))] for i in news.index  ]
news

data = pd.concat([research[['title_abstract','source', 'url']], news[['title_abstract', 'source', 'url']]]).rename(columns={'title_abstract':'title'}).drop_duplicates().reset_index(drop=True)

print('News:',news.shape)
print('Research:',research.shape)
print('Combined data:',data.shape)

del research
del news
gc.collect()

data

# Title similarity search and Topic Modelling

In [None]:
# Gensim word embeddings
# https://www.kaggle.com/raymishra/sentence-similarity-match
# https://radimrehurek.com/gensim/models/fasttext.html
# https://radimrehurek.com/gensim/models/keyedvectors.html

import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

from gensim.models.keyedvectors import KeyedVectors

filepath = "../input/gnewsvector/GoogleNews-vectors-negative300.bin"


from gensim.models import KeyedVectors
wv_from_bin = KeyedVectors.load_word2vec_format(filepath, binary=True) 

#extracting words7 vectors from google news vector
embeddings_index = {}
for word, vector in zip(wv_from_bin.vocab, wv_from_bin.vectors):
    coefs = np.asarray(vector, dtype='float32')
    embeddings_index[word] = coefs

In [None]:
# helper functions

def preprocess(doc):
#     doc = re.sub(r'[\W\d]+',' ',doc)  # Remove numbers and punctuation.
    doc = doc.lower()  # Lower the text.
    doc = word_tokenize(doc)  # Split into words.
    doc = [w for w in doc if not w in stop_words]  # Remove stopwords.
    doc = [w for w in doc if w.isalpha()]  # Remove numbers and punctuation.
    return doc

def avg_feature_vector(sentence, model, num_features):
#     words = sentence.lower().split()
#     words = preprocess(sentence)
    words = simple_preprocess(sentence)
    #feature vector is initialized as an empty array
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in embeddings_index.keys():
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

from scipy.spatial import distance
def calc_dist_cosine(s1, target, max_dist=0.5):
    ret = []
    for t in tqdm(target):
        tv = avg_feature_vector(t,model= embeddings_index, num_features=300)
        qv = avg_feature_vector(q,model= embeddings_index, num_features=300)
        dist = distance.cosine(tv, qv)
        if dist <= max_dist:
            ret.append([dist, t])
    df = pd.DataFrame(ret,columns=['dist','title']).reset_index(drop=True)
    return pd.merge(df, data, on='title', how='left').sort_values(by='dist', ascending=True).reset_index(drop=True)


# wv_from_bin.init_sims(replace=True)  # Normalizes the vectors in the word2vec class before calculating wmdistance
def calc_dist_wm(s1, target, max_dist=5.0):
    """
    Word mover distance. Slower than cosine similarity.
    https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
    """
    ret = []
    for t in tqdm(target):
#         print(t)
        dist = wv_from_bin.wmdistance(preprocess(s1), preprocess(t))
        if dist <= max_dist:
            ret.append([dist, t])
    df = pd.DataFrame(ret,columns=['dist','title']).reset_index(drop=True)
    return pd.merge(df, data, on='title', how='left').sort_values(by='dist', ascending=True).reset_index(drop=True)
       
def calc_dist(s1, target):
    """
    Dist interface
    """
    return calc_dist_cosine(s1, target)
#     return calc_dist_wm(s1, target)

# usage
# s1_afv = avg_feature_vector('Why the second proforma does not coincide with the first, what has changed', model= embeddings_index, num_features=300 )
# s2_afv = avg_feature_vector('Again came the proforma double.In the morning there was already a proforma with the same positions, but under a different number',model= embeddings_index, num_features=300)
# cos = distance.cosine(s1_afv, s2_afv)
# print(cos)
# calc_dist_wm('Why the second proforma does not coincide with the first, what has changed', ['Again came the proforma double.In the morning there was already a proforma with the same positions, but under a different number'])

In [None]:
# LDA
# https://www.kaggle.com/monsterspy/topic-modeling-with-lda
# https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

from gensim.models import ldamodel
import gensim.corpora
from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
stop.update(['href','br'])
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

num_topics = 5

def train_lda(data_text):
    train_ = []
    for i in range(len(data_text)):
        train_.append([word for word in tokenizer.tokenize(data_text[i].lower()) if word not in stop])

    id2word = gensim.corpora.Dictionary(train_)
    corpus = [id2word.doc2bow(text) for text in train_]    # Term Document Frequency
    lda = ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics)
    return lda

def get_lda_topics(model, num_topics, topn=5):
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = topn);
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict)


# Results

The topics include virus, cov, rna, protein, antibody, cells, infection which are research-based words in our data.

In [None]:
# lda = train_lda(data.title.values.tolist())
# lda_all_titles = get_lda_topics(lda, num_topics)
# lda_all_titles

In [None]:
def make_clickable(link):
    # target _blank to open new window
    return f'<a target="_blank" href="{link}">{link}</a>'
# df.style.format({'url': make_clickable})

In [None]:
def search(q, out_prefix="result"):
    res = calc_dist(q, data.title)
    res.to_csv(f'result_{out_prefix}.csv', index=False)

    # second iteration using word distance
#     res2 = calc_dist_wm(q, res.title)
#     res2.to_csv(f'result_{out_prefix}_wmd.csv', index=False)

#     lda = train_lda(res.title.values.tolist())
#     lda_res = get_lda_topics(lda, num_topics)
#     print(lda_res)
    
    topn = 20
    wc = WordCloud(background_color='white', stopwords=stop_words).generate(' '.join(res.title.values.tolist()[:topn]).lower())
    plt.imshow(wc)
    plt.axis('off')

    return res


# Ethical and social science considerations

In [None]:
q='ethical and social science considerations'
res = search(q, 'ethics_considerations')
res.head(20)[['title','url']].style.format({'url': make_clickable})

# Sustained education, access, and capacity building in the area of ethics

In [None]:
q='sustained education, access, and capacity building in the area of ethics'
res = search(q, 'sustained_ethics')
res.head(20)[['title','url']].style.format({'url': make_clickable})

# Qualitative assessment frameworks

In [None]:
q='qualitative assessment frameworks and secondary impacts of public health measures for prevention and control'
res = search(q, 'assessment')
res.head(20)[['title','url']].style.format({'url': make_clickable})

# Burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients

In [None]:
q='burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients'
res = search(q, 'burden')
res.head(20)[['title','url']].style.format({'url': make_clickable})

# Underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media.

In [None]:
q='underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media'
res = search(q, 'misinformation')
res.head(20)[['title','url']].style.format({'url': make_clickable})

Ethical and social science considerations?

* Efforts to articulate and translate existing ethical principles and standards to salient issues in COVID-2019
* Efforts to embed ethics across all thematic areas, engage with novel ethical issues that arise and coordinate to minimize duplication of oversight
* Efforts to support sustained education, access, and capacity building in the area of ethics
* Efforts to establish a team at WHO that will be integrated within multidisciplinary research and operational platforms and that will connect with existing and expanded global networks of social sciences.
* Efforts to develop qualitative assessment frameworks to systematically collect information related to local barriers and enablers for the uptake and adherence to public health measures for prevention and control. This includes the rapid identification of the secondary impacts of these measures. (e.g. use of surgical masks, modification of health seeking behaviors for SRH, school closures)
* Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients and identify the immediate needs that must be addressed.
* Efforts to identify the underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media.
