# COVID-19 News Matcher

## *Matching news articles with COVID-19 research papers*

The topic of this submission is to demonstrate the matching of newspaper articles with research papers of the COVID-19 Open Research Dataset Challenge (CORD-19). This would allow to search for research papers that are related to a given newspaper article. In this way, we want to contribute to knowledge transfer of science and facilitate to gather relevant research insights that are connected to topics discussed in the media and facilitate the access of this expert knowledge to the public and decision makers.

Our work investigates how far such a matching can be accomplished using only a common, standardised vector embedding â€“ specifically, using an embedding that is optimised for scientific articles also for newspaper articles. One purpose of this approach is to see whether the scientific classification generalises to news, which would indicate if it is also generalisable to other non-science domains.

We use the SPECTER embeddings of the abstracts of COVID-19 related research papers that are available from CORD-19. Then we use the public SPECTER api to get embeddings for newspaper articles of "The Guardian". We assume that the writing style of the abstracts is somewhat similar to that of a newspaper article. Based on the embeddings, we can compute cosine similarities and find the most similar research papers based on their abstracts for a given newspaper article.

Readers are encouraged to test the notebook with more news articles, or other texts from different domains.

Credits: 
* https://github.com/allenai/paper-embedding-public-apis (https://www.semanticscholar.org/)
* https://medium.com/@Intellica.AI/comparison-of-different-word-embeddings-on-text-similarity-a-use-case-in-nlp-e83e08469c1c
* https://www.kaggle.com/maksimeren/covid-19-literature-clustering 
* https://www.theguardian.com/

# 1. SPECTER embeddings of research papers

Our analysis is based not on the article plaintext, but on vector embeddings of entire article abstracts. While this loses a lot of information, it also allows us to tap into a simplification that has been extensively studied and optimised for other work.

SPECTER is one such embedding format. In the CORD19 datasets, this embedding is already available in a single csv file, so for the scientific papers we can simply load it. (We also use this format for newspaper articles, see further below.)

In [None]:
import json
import requests
import time
import os
import pickle
import csv
import numpy as np

In [None]:
def read_prepared_specter_embeddings(csv_path):
    with open(csv_path, newline='') as csvfile:
        l = list(csv.reader(csvfile, delimiter=','))
    # The entries have as their first value a unique id (hash),
    # followed by the actual vector. Extract this into a dict
    # for better access.
    embeddingsDict = {}
    uidList = []
    for paperE in l:
        cp = list(paperE)
        uid = cp.pop(0)
        uidList.append(uid)
        newList = np.array([float(emb) for emb in cp])
        embeddingsDict[uid] = newList
    return uidList, embeddingsDict

In [None]:
uidList, cord19_dict = read_prepared_specter_embeddings(
    '/kaggle/input/CORD-19-research-challenge/cord19_specter_embeddings_2020-04-10/cord19_specter_embeddings_2020-04-10.csv')

# 2. SPECTER embedding of news articles
For comparison, we use articles from the Guardian newspaper and encode them into the vector embedding using SemanticScholar's API service. First the articles themselves need to be fetched.

### Guardian API
Requesting from the Guardian API requires registering as a developer. After registration, one can use the key for retrieving the articles.

In [None]:
myGuardianAPIkey = None
GuardianArticlesStoragePath = '/kaggle/input/guardian-covid19/Guardian_Covid-19' # TODO: absolute kaggle path

In [None]:
def grabArticles(nArticles, sectionRestr = None):
    for page in range(nArticles//10):
        searchTerm = 'Covid-19'
        requestParams = { 'q': searchTerm 
                        , 'show-fields': 'bodyText'
                        , 'page': page+1
                        , 'api-key': myGuardianAPIkey
                        }
        if sectionRestr is not None:
            requestParams['section'] = sectionRestr
        relevantArticlesResp = requests.get(
                    'https://content.guardianapis.com/search'
                  , requestParams ).json()
            
        for entry in relevantArticlesResp['response']['results']:
            fname = '_'.join(entry['webUrl'].split('/')[3:]) + '.json'
            textContent = entry['fields']['bodyText']
            # Pack the content into a dictionary structure compatible
            # for the SemanticScholar encoder. We use the body text as
            # both the article text and the abstract: arguably, a news
            # article is already more like the abstract of a scientific
            # article, since it is written in a higher-level style
            # suitable for non-experts.
            jsonFormatted = {
                     'url': entry['webUrl']
                   , 'title': entry['webTitle']
                   , 'abstract': textContent
                   , 'body_text': textContent
                   }
            with open(GuardianArticlesStoragePath+'/'+fname, 'w') as fh:
                json.dump(jsonFormatted, fh)
        time.sleep(10) # avoid too fast requests to API

 We retrieved a collection of articles as top-hits for the `Covid-19` search term. Most of the articles are in the section `world`, but of particular interest are those from the `science` section so we also request those separately.

In [None]:
# grabArticles(150)            # Uncomment to retrive articles anew
# grabArticles(50, 'science')

### Vector embedding
To get the news articles into a compatible vector format to the scientific ones, we use a SemanticScholar service, which takes the JSON encoding we created in `grabArticles` and returns the desired vector embedding.

In [None]:
URL = "https://model-apis.semanticscholar.org/specter/v1/invoke"
MAX_BATCH_SIZE = 16

In [None]:
# Source: https://github.com/allenai/paper-embedding-public-apis
def chunks(lst, chunk_size=MAX_BATCH_SIZE):
    """Splits a longer list to respect batch size"""
    for i in range(0, len(lst), chunk_size):
        yield lst[i : i + chunk_size]
        
def embed(papers):
    embeddings_by_paper_id: Dict[str, List[float]] = {}

    for chunk in chunks(papers):
        # Allow Python requests to convert the data above to JSON
        response = requests.post(specterURL, json=chunk)

        if response.status_code != 200:
            raise RuntimeError("Sorry, something went wrong, please try later!")

        for paper in response.json()["preds"]:
            embeddings_by_paper_id[paper["paper_id"]] = paper["embedding"]

    return embeddings_by_paper_id

def loadJsonPaper(fp):
    paperId = os.path.basename(fp).split('.')[0]
    with open(fp, 'r') as fh:
        content = json.load(fh)
    content['paper_id'] = paperId
    return content

### Embedding of 6 articles
We use for comparison a manual selection of 6 articles from the retrieved Guardian corpus. The articles have been chosen both from science-near and more trivial contexts.

* Ebola article: https://www.theguardian.com/science/2020/feb/20/doctors-hiv-ebola-drugs-coronavirus-cure-covid-19
* Chloroquine article: https://www.theguardian.com/science/2020/mar/25/can-chloroquine-really-help-treat-coronavirus-patients
* Cricket article: https://www.theguardian.com/sport/2020/mar/26/ecb-steve-elworthy-cricket-coronavirus
* Ventilator article: https://www.theguardian.com/world/2020/mar/29/ventilator-challenge-uk-to-start-production-in-covid-19-fight
* Ibuprofen article: https://www.theguardian.com/world/2020/mar/16/health-experts-criticise-nhs-advice-to-take-ibuprofen-for-covid-19
* Testing article: https://www.theguardian.com/world/2020/mar/30/fall-in-covid-19-tests-putting-lives-at-risk-critics-claim

In [None]:
relevantNewsArticleIds = ['world_2020_mar_16_health-experts-criticise-nhs-advice-to-take-ibuprofen-for-covid-19',
                          'science_2020_mar_25_can-chloroquine-really-help-treat-coronavirus-patients',
                          'science_2020_feb_20_doctors-hiv-ebola-drugs-coronavirus-cure-covid-19',
                          'sport_2020_mar_26_ecb-steve-elworthy-cricket-coronavirus',
                          'world_2020_mar_30_fall-in-covid-19-tests-putting-lives-at-risk-critics-claim',
                          'world_2020_mar_29_ventilator-challenge-uk-to-start-production-in-covid-19-fight']

In [None]:
# We have already saved the embeddings previously. Set this to False
# to request them anew.
loadPrecomputedNewsEmbeddings = True

def getRelevantNewsEmbeddings():
    embeddingDir = GuardianArticlesStoragePath+'_specter'
    if not loadPrecomputedNewsEmbeddings:
        os.makedirs(embeddingDir, exist_ok=True)

    if loadPrecomputedNewsEmbeddings:
        articleDict = {}
        for articleId in relevantNewsArticleIds:
            articleDict[articleId] = pickle.load(
              open(embeddingDir+'/'+articleId, 'rb'))
        return articleDict
    else:
        relevantNewsContent = [
         loadJsonPaper("Guardian_Covid-19/"+articleId+".json")
            for articleId in relevantNewsArticleIds ]
        embedded = embed(relevantNewsContent)
        for articleId, embd in embedded.items():
            fp = embeddingDir+'/'+articleId
            pickle.dump(embd, open(fp, 'wb'))
        return embedded

def getRelevantNewsEmbedIDarray():
    embeddingDir = GuardianArticlesStoragePath+'_specter'
    if not loadPrecomputedNewsEmbeddings:
        os.makedirs(embeddingDir, exist_ok=True)
    if loadPrecomputedNewsEmbeddings:
        articleArray = np.zeros((len(relevantNewsArticleIds),768))
        IdList = []
        for i, articleId in enumerate(relevantNewsArticleIds):
            articleArray[i] = pickle.load(
              open(embeddingDir+'/'+articleId, 'rb'))
            IdList.append(articleId)
        return IdList, articleArray

# 3. Dimensionality reduction and visual interpretation

Now that we have the embedding vectors of the research papers and news articles, we are interested in how they relate to each other. One way to investigate this is to reduce the dimensionality so far that we can actually create 2d plots. First, we use principal component analysis (PCA) to reduce dimensions from 768 to 87, and then use t-SNE to reduce dimension from 87 to 2. The reason for the more primitive PCA step is that doing t-SNE (T-distributed Stochastic Neighbor Embedding) from 768 to 2 directly is computational slow. Finally we plot the embeddings.

Credits: https://www.kaggle.com/maksimeren/covid-19-literature-clustering 

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt

### Dimensionality reduction using PCA

In [None]:
def pcadimred(data):
    pca = PCA(n_components=0.8)
    return pca.fit_transform(data)

In [None]:
cord19_full = np.zeros((len(uidList),len(cord19_dict[uidList[0]])))
for i,uid in enumerate(uidList):
    cord19_full[i,:] = cord19_dict[uid]

In [None]:
newsidList, news_embeddings = getRelevantNewsEmbedIDarray()

In [None]:
cord19_news_full = np.vstack((cord19_full,news_embeddings))
print('Original dimensions: ', np.shape(cord19_news_full))
cord19_news_red = pcadimred(cord19_news_full)
print('Reduced dimensions using PCA: ', np.shape(cord19_news_red))

### Visualisation using t-SNE

In [None]:
def tsnedimred(data):
    tsne = TSNE(verbose=1)
    return tsne.fit_transform(data)

In [None]:
# Uncomment to calculate again (takes a few minutes)
#cord19_news_2d = tsnedimred(cord19_news_red)

In [None]:
# Saving and loading cord19_news_2d to disk
cord19_news_2d = np.load('/kaggle/input/covidtemp/cord19_news_2d.npy')
#cord19_news_2d = np.load('/kaggle/working/cord19_news_2d.npy')
np.save('/kaggle/working/cord19_news_2d', cord19_news_2d)

In [None]:
cord19_2d = (cord19_news_2d[:-len(newsidList),0], cord19_news_2d[:-len(newsidList),1])
news_2d = (cord19_news_2d[len(uidList):,0], cord19_news_2d[len(uidList):,1])
data = (cord19_2d, news_2d)
colors = ("blue", "red")
groups = ("CORD-19", "News articles")
size = (1,30)

# Create plot
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(1, 1, 1, facecolor="1.0")

for data, color, group, size in zip(data, colors, groups, size):
    x, y = data
    ax.scatter(x, y, alpha=0.8, c=color, edgecolors='none', s=size, label=group)

plt.title('t-SNE of CORD-19 and 6 news articles')
plt.legend(loc=2)
plt.savefig("/kaggle/working/t-sne_cord19_news.png")
plt.show()

In [None]:
cord19_2d = (cord19_news_2d[:-len(newsidList),0], cord19_news_2d[:-len(newsidList),1])
news_2d = np.asarray((cord19_news_2d[len(uidList):,0], cord19_news_2d[len(uidList):,1]))

data = (cord19_2d, news_2d[:,0], news_2d[:,1], news_2d[:,2], news_2d[:,3], news_2d[:,4], news_2d[:,5])
colors = ("blue", "red", "green", "black", "yellow", "cyan", "orange")
groups = ("CORD-19", newsidList[0], newsidList[1], newsidList[2], newsidList[3], newsidList[4], newsidList[5])
size = (1,30,30,30,30,30,30)

# Create plot
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(1, 1, 1, facecolor="1.0")

for data, color, group, size in zip(data, colors, groups, size):
    x, y = data
    ax.scatter(x, y, alpha=0.8, c=color, edgecolors='none', s=size, label=group)

plt.title('t-SNE of CORD-19 and 6 news articles')
plt.legend(loc='lower right')
plt.savefig("/kaggle/working/t-sne_cord19_news_withlabels.png")
plt.show()

We see that the news articles all are located in the same region of the scatterplot. This suggests that they have some similarity between each other and don't cover the broadness in terminology and topics of the research article abstracts. The individual closeness of news article embeddings in the 2D space is hard to interpret. We conclude that the analysis of the relation between research abstracts and news articles based on embeddings is worth being studied in more detail.

# 4 News article matching

We are approaching the central question behind this notebook: *Can we use the vector embedding of a news article to find similar research papers in the CORD-19 database?*

Through the t-SNE analysis in the previous section we saw that the news articles embeddings "reside amongst the research papers" and are not clustered too strongly in their own news cluster. While this is a good sign it will be interesting to see the, for instance, 10 most similar research papers to a given news article. We will do this here through a completely anecdotal study of a couple of the Guardian articles.

## Imports

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import pprint

## Reading metadata.csv

In [None]:
# Hack to force abstract column to be wider
abstract_long_name = '_________________abstract_of_research_paper_________________'

def read_metadata(metadata_path):
    meta_df = pd.read_csv(metadata_path, dtype={
        'pubmed_id': str,
        'Microsoft Academic Paper ID': str, 
        'doi': str
    })
    # Hack to force abstract column to be wider
    meta_df.rename(
        columns = {'abstract': abstract_long_name}, 
        inplace = True)
    return meta_df
metadata = read_metadata('/kaggle/input/CORD-19-research-challenge/metadata.csv')

## Cosine similarity functions

Cosine similarity is used as a measure of distance between two vector embeddings.

\begin{equation}
\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }
\end{equation}

https://en.wikipedia.org/wiki/Cosine_similarity

In [None]:
def get_cosine_similarity(feature_vec_1, feature_vec_2):
    """https://medium.com/@Intellica.AI/comparison-of-different-word-embeddings-on-text-similarity-a-use-case-in-nlp-e83e08469c1c"""
    return cosine_similarity(feature_vec_1.reshape(1, -1), feature_vec_2.reshape(1, -1))[0][0]

def get_cosine_similarities(cord_papers_embeddings, news_embedding):
    d = {}
    for uid, paper_embedding in cord_papers_embeddings.items():
        d[uid] = get_cosine_similarity(news_embedding, paper_embedding)
    return d

def add_cosine_sim_to_metadata_df(metadata_df, similarities_dict):
    metadata_df_c = metadata_df.copy() # Deep copy
    metadata_df_c['cosine_similarity'] = metadata_df_c['cord_uid'].map(similarities_dict)
    return metadata_df_c.sort_values(by='cosine_similarity', ascending=False)

## Adding similarity scores to the metadata dataframe

In [None]:
import textwrap

def make_clickable(val):
    # target _blank to open new window
    return '<a target="_blank" href="{}">URL</a>'.format(val)

def change_abstract_name(val):
    cell_length = 800
    val = str(val)
    if len(val) > cell_length:
        val = textwrap.wrap(val, cell_length)[0] + '...'
    return val

def url_style(metadata_df):
    return metadata_df.style.format({'url': make_clickable, 
                                     abstract_long_name: change_abstract_name})

cols_to_display = ['cosine_similarity', 'publish_time', 'url', 'title', 
                   abstract_long_name, 'authors', 'journal', 'cord_uid']

def show_k_most_similar_papers(metadata_df, k):
    return url_style(metadata_df.head(k).loc[:, cols_to_display])

def show_k_least_similar_papers(metadata_df, k):
    return url_style(metadata_df.tail(k).loc[:, cols_to_display])

## Anecdotal testing with a few select news articles

* Ebola article: https://www.theguardian.com/science/2020/feb/20/doctors-hiv-ebola-drugs-coronavirus-cure-covid-19
* Cricket article: https://www.theguardian.com/sport/2020/mar/26/ecb-steve-elworthy-cricket-coronavirus
* Ibuprofen article: https://www.theguardian.com/world/2020/mar/16/health-experts-criticise-nhs-advice-to-take-ibuprofen-for-covid-19

In [None]:
news_embeddings = getRelevantNewsEmbeddings()
pp = pprint.PrettyPrinter()
pp.pprint(list(news_embeddings.keys()))

## The Ibuprofen article
Ibuprofen article: https://www.theguardian.com/world/2020/mar/16/health-experts-criticise-nhs-advice-to-take-ibuprofen-for-covid-19

In [None]:
ibuprofen_embedding = np.array(news_embeddings['world_2020_mar_16_health-experts-criticise-nhs-advice-to-take-ibuprofen-for-covid-19'])
ibuprofen_similarity = add_cosine_sim_to_metadata_df(metadata, get_cosine_similarities(cord19_dict, ibuprofen_embedding))

### The Ibuprofen article - 10 most similar research papers

The list below is ranked by the cosine similarity between the Ibuprofen news article and the research papers in question.

What do we see? The Guardian story is about concern in the UK that the NHS was advicing to take ibuprofen. Our most similar research paper seems to nail this topic pretty well. It is an article about NSAIDs and its safety in the treatment of COVID-19 patients. Further down the list results are not that clear. Some research papers are about UK health service, but does not seem to mention ibuprofen or NSAIDs in particular.

In [None]:
show_k_most_similar_papers(ibuprofen_similarity, 10)

### The Ibuprofen article - 5 least similar

As as sanity check we also show the least similar articles. After some inspection we noticed that the least similar were usually research papers with no abstracts. That would make sense since its embedding could be considered corrupt or non-existent. In a future revision of the code a `dropna()` function should be implemented to drop papers without abstracts from matching.

In the next article examples we will not show the least similar matches.

In [None]:
show_k_least_similar_papers(ibuprofen_similarity, 5)

## The cricket article 

Cricket article: https://www.theguardian.com/sport/2020/mar/26/ecb-steve-elworthy-cricket-coronavirus

In [None]:
cricket_embedding = np.array(news_embeddings['sport_2020_mar_26_ecb-steve-elworthy-cricket-coronavirus'])
cricket_similarity = add_cosine_sim_to_metadata_df(metadata, get_cosine_similarities(cord19_dict, cricket_embedding))

### The cricket article - 10 most similar research papers

The list below is ranked by the cosine similarity between the cricket news article and the research papers in question.

What about these matches? The Guardian article is about cancellation of cricket games and how/if future games can be arranged during the pandemic. Our best match is about events being cancelled. Not exactly sports as this seem to be related to *chemical events*. Further down the list the results are somewhat mixed. Not all the results seem to be related to events.

In [None]:
show_k_most_similar_papers(cricket_similarity, 10)

## The Ebola article 

Ebola article: https://www.theguardian.com/science/2020/feb/20/doctors-hiv-ebola-drugs-coronavirus-cure-covid-19

In [None]:
ebola_embedding = np.array(news_embeddings['science_2020_feb_20_doctors-hiv-ebola-drugs-coronavirus-cure-covid-19'])
ebola_similarity = add_cosine_sim_to_metadata_df(metadata, get_cosine_similarities(cord19_dict, ebola_embedding))

### The Ebola article - 10 most similar research papers

The list below is ranked by the cosine similarity between the Ebola news article and the research papers in question.

The Guardian article is about how medical professionals are investigating whether medications for other diseases, like Ebola and HIV, are effective against the current pandemic. In general the research papers seem to match very well to this topic.

In [None]:
show_k_most_similar_papers(ebola_similarity, 10)

# 5. Further work

In this report we have shown promising results, however limited and anecdotal, of using SPECTER embeddings for news articles and finding similarities between news and research papers. Limitations: English language only, only one news source (The Guardian), only a few articles and no real quantitative measurement of the results.

We want to further optimize the embeddings and maybe account try to adjust for the different domains (news vs. research). It would also be interesting to investigate whether cosine similarity is the best measure of similarity for this problem. 

Finally we are also considering creating an online tool to help internet researchers fact check news articles with published research. Input news article URL, output list of research papers most similar to the news article. Any feedback on the need for, as well as the features of, such a tool is much appreciated.