# Our Team

This submission is the result of a collaboration between the following:

* Catherine Doyle ([Catherine Doyle](https://www.kaggle.com/cddata))
* John Doyle ([JohnDoyle](https://www.kaggle.com/johndoyle))
* Keith Finnerty ([KeithFinnerty](https://www.kaggle.com/ketihf))
* Piyush Rumao ([PIYUSH RUMAO](https://www.kaggle.com/piyushrumao))

We are all members of the Data Insights team in Deutsche Bank. Sitting within the Chief Data Office, the Data Insights team is the global Centre of Excellence for data science and data analytics within Deutsche Bank. Mostly based in Dublin, Ireland with some team members sitting in London and other locations, the Data Insights team comprises approximately forty people and includes data scientists, data analysts, data engineers, data architects, data visualization experts, and programme managers. We have expertise in areas such as advanced analytics, artificial intelligence, machine learning, natural language processing, visualization, dashboards, and software development.  We engage with teams across all areas and functions of the Bank, partnering with them to design and deliver analytics solutions that leverage the large amount of available data in order to create value for the Bank and our clients. Engagements vary between time-boxed Proofs of Concept and longer-term projects, and are conducted using the Scaled Agile Framework (SAFe).  As a Centre of Excellence, we also work to uplift data science and analytics across the Bank, for example by fostering Communities of Practice on different topics and rolling out governance support and resources around best practices for model development and analytics delivery.

We decided to collaborate on this Kaggle challenge and pool our knowledge and skills with the aim of making a positive impact in the on-going struggle against COVID-19.

# CORD: Risk Factor 

The following analysis is supporting material for the work carried out in an [accompanying notebook](https://www.kaggle.com/cddata/documenting-sub-tasks-dictionaries/) on the risk factors associated with contracting COVID-19 and is working towards contributing to the question [**What do we know about COVID-19 risk factors?**](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=558) proposed under the [COVID-19 Open Research Dataset Challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)

## Approach

Initial ETL is carried out in our notebook [**Load and Process Abstracts**](https://www.kaggle.com/johndoyle/fork-of-load-and-process-data-abstracts) where for processing speed the abstract and reference text bodies are analysed for entity pairs. This can be expanded later to process the full text which will generate a more in depth analysis.



This analysis will build on this work and aims to enrich and extract related material from the CORD corpus in the following process:
1. Load abstract entity enriched document objects.
2. Create a directed graph of entities and linkages from the extracted information. 
3. Enrich refined dictionary of terms using a pre-trained Word-to-Vec model to resolve similar entities to the target dictionary.
4. Merge nodes using the enriched dictionary of terms to improve publication linkage.
5. Extract and analyse relevant paths between the target risk and the virus returning a list of publications which correspond to the edges of the graph.

In the analysis step, background research is introduced to produce lists of related keywords to analyse that corpus under 4 main topics related to COVID-19 risks task. 


# Goal 

A set of linked publications relating to the target topic, in this case the risk associated with contracting COVID-19.

In [None]:
import networkx as nx


import sys, os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import json
import pickle
import gc
import difflib
import gensim

import ipywidgets as widgets
from IPython.display import display, HTML, clear_output


import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")


In [None]:
files = []
import os
for dirname, _, filenames in os.walk('/kaggle/input/load-and-process-data-abstracts/'):
        filenames = [names for names in filenames if '.pickle' in names]
        if filenames != []:
            files.append({'dirpath':dirname, 'filenames':filenames})

In [None]:
# recreate the schema from "json_schema.txt"
class author():
    """
    This class object is derived from the json schema associated with the CORD
    publications under the Author class.
    
    input_dict: This is the Author object from the publication class
    """
    
    def __init__(self, input_dict=None):
        
        self.first = ""
        self.middle = []
        self.last = ""
        self.suffix = ""
        self.affiliation = {}
        self.email = ""
        
        if input_dict:
            for key in input_dict.keys():
                if "first" in key:
                    self.first = input_dict[key]
                if "middle" in key:
                    self.middle = input_dict[key]
                if "last" in key:
                    self.last = input_dict[key]
                if "suffix" in key:
                    self.suffix = input_dict[key]
                if "affiliation" in key:
                    self.affiliation = input_dict[key]
                if "email" in key:
                    self.email = input_dict[key]    
    
    def print_items(self):
        
        print("first: " + str(self.first) +  
              ", middle: " + str(self.middle) + 
              ", last: " + str(self.last) + 
              ", suffix: " + str(self.suffix) +
              ", email: " + str(self.email) + 
              ", affiliation: " + json.dumps(self.affiliation, indent=4, sort_keys=True)
             )


class inline_ref_span():
    """
    This class object is derived from the json schema associated with the CORD
    publications under the reference location.
    
    input_dict: this is the reference location object from the document class.
    """
    
    def __init__(self, input_dict=None):
        
        self.start = 0
        self.end = 0
        self.text = ""
        self.ref_id = ""
        
        if input_dict:
            for key in input_dict.keys():
                if "start" in key:
                    self.start = input_dict[key]
                if "end" in key:
                    self.end = input_dict[key]
                if "text" in key:
                    self.text = input_dict[key]
                if "ref_id" in key:
                    self.ref_id = input_dict[key]
    
    def print_items(self):
        
        print("Text: " + str(self.text) + ", Start: " + 
              str(self.start) + ", End: " + str(self.end) + 
              ", Ref_id: " + str(self.ref_id))

    def step_index(self, n):
        
        self.start += n
        self.end += n
        
        
class text_block():
    """
    This class object is derived from the json schema associated with the CORD
    publications and is associated with all relevent text blocks.
    
    input_dict: this is the text block object [abstract, text_body, ...] from the document class.
    """
    
    def __init__(self, input_dict=None):
        
        self.text = ""
        self.cite_spans = []
        self.ref_spans = []
        self.eq_spans = []
        self.section = ""
        
        if input_dict:
            for key in input_dict.keys():
                if "text" in key:
                    self.text = input_dict[key]
                if "cite_spans" in key:
                    self.cite_spans = [inline_ref_span(c) for c in input_dict[key]]                
                if "ref_spans" in key:
                    self.ref_spans = [inline_ref_span(r) for r in input_dict[key]] 
                if "eq_spans" in key:
                    self.eq_spans = [inline_ref_span(e) for e in input_dict[key]]
                if "section" in key:
                    self.section = input_dict[key]
        
    def clean(self, swap_dict=None):
            
        self.text = clean(self.text, swap_dict)
    
    def print_items(self):
        
        print("\ntext: " + str(self.text))
        print("\nsection: " + str(self.section))
        print("\ncite_spans: ")
        [c.print_items() for c in self.cite_spans]
        print("\nref_spans: ")
        [r.print_items() for r in self.ref_spans]
        print("\neq_spans: ")
        [e.print_items() for e in self.eq_spans]


def combine_text_block(text_block_list):
    
    if text_block_list:
        
        combined_block = text_block_list[0]
        block_length = len(combined_block.text)
        
        for i in range(1,len(text_block_list)):
            combined_block.text += " " + text_block_list[i].text
            block_length += 1
            
            # update spans start & stop index
            [ref.step_index(block_length) for ref in text_block_list[i].cite_spans]
            [ref.step_index(block_length) for ref in text_block_list[i].ref_spans]
            [ref.step_index(block_length) for ref in text_block_list[i].eq_spans]
            
            # combine spans
            combined_block.cite_spans += text_block_list[i].cite_spans
            combined_block.ref_spans += text_block_list[i].ref_spans
            combined_block.eq_spans += text_block_list[i].eq_spans           
            combined_block.section += ", " + str(text_block_list[i].section)           
            
            block_length += len(text_block_list[i].text)
                       
        return [combined_block]
    else:
        return [text_block()]
      

class bib_item():
    """
    This class object is derived from the json schema associated with the CORD
    publications and is associated with all relevent bibliography objects.
    
    input_dict: this is associated with the bibliography objects in the the document class.
    """
    
    def __init__(self, input_dict=None):
        
        self.ref_id: ""
        self.title: ""
        self.authors = []
        self.year = 0
        self.venue = ""
        self.volume = ""
        self.issn = ""
        self.pages = ""
        self.other_ids = {}
        
        if input_dict:
            for key in input_dict.keys():
                if "ref_id" in key:
                    self.ref_id = input_dict[key]
                if "title" in key:
                    self.title = input_dict[key]
                if "authors" in key:
                    self.authors = [author(a) for a in input_dict[key]]
                if "year" in key:
                    self.year = input_dict[key]
                if "venue" in key:
                    self.venue = input_dict[key]
                if "volume" in key:
                    self.volume = input_dict[key]
                if "issn" in key:
                    self.issn = input_dict[key]
                if "pages" in key:
                    self.pages = input_dict[key]
                if "other_ids" in key:
                    self.other_ids = input_dict[key]
    
    def print_items(self):
        
        print("\nBib Item:")
        print("ref_id: " + str(self.ref_id))
        print("title:" + str(self.title))
        print("Authors:")
        [a.print_items() for a in self.authors]
        print("year: " + str(self.year))
        print("venue:" + str(self.venue))
        print("issn:" + str(self.issn))
        print("pages:" + str(self.pages))
        print("other_ids:" + json.dumps(self.other_ids, indent=4, sort_keys=True))
        
        
class ref_entries():
    
    def __init__(self, ref_id=None, input_dict=None):
        
        self.ref_id = ""
        self.text = ""
        self.latex = None
        self.type = ""
        
        if ref_id:
            self.ref_id = ref_id
            
            if input_dict:
                for key in input_dict.keys():
                    if "text" in key:
                        self.text = input_dict[key]
                    if "latex" in key:
                        self.latex = input_dict[key]
                    if "type" in key:
                        self.type = input_dict[key]
    
    def print_items(self):
        
        print("ref_id: " + str(self.ref_id))
        print("text:" + str(self.text))
        print("latex: " + str(self.latex))
        print("type:" + str(self.type))
        
                    
class back_matter():
    
    def __init__(self, input_dict=None):
        
        self.text = ""
        self.cite_spans = []
        self.ref_spans = []
        self.section = ""
        
        if input_dict:
            for key in input_dict.keys():
                if "text" in key:
                    self.text = input_dict[key]
                if "cite_spans" in key:
                    self.cite_spans = [inline_ref_span(c) for c in input_dict[key]]                
                if "ref_spans" in key:
                    self.ref_spans = [inline_ref_span(r) for r in input_dict[key]] 
                if "section" in key:
                    self.section = input_dict[key]
    
    def print_items(self):
        
        print("text: " + str(self.text))
        print("cite_spans: ")
        [c.print_items() for c in self.cite_spans]
        print("ref_spans: ")
        [r.print_items() for r in self.ref_spans]        
        print("section:" + str(self.section))

        
# The following Class Definition is a useful helper object to store various 
# different covid-19 data types.
class document():
    """
    The following class object is based on the schema for each publication with 
    appropiate sub classes for more complex data types. 
    
    This object aims to make the analysis of mixed data type quicker and more intutitive.
    """
    def __init__(self, file_path=None):
        
        self.doc_filename = ""
        self.doc_language = {}
        self.paper_id = ""
        self.title = ""
        self.authors = []
        self.abstract = []
        self.text = []
        self.bib = []
        self.ref_entries = []
        self.back_matter = []
        self.tripples = {}
        self.key_phrases = {}
        self.entities = {}
        
        # load content from file on obj creation
        self.load_file(file_path)
     
    def _load_paper_id(self, data):
        
        if "paper_id" in data.keys():
            self.paper_id = data['paper_id']
    
    def _load_title(self, data):
        
        if "metadata" in data.keys():
            if "title" in data['metadata'].keys():
                self.title = data['metadata']["title"]
    
    def _load_authors(self, data):
        
        if "metadata" in data.keys():
            if "authors" in data['metadata'].keys():
                self.authors = [author(a) for a in data['metadata']["authors"]]
                
    def _load_abstract(self, data):
        
        if "abstract" in data.keys():
            self.abstract = [text_block(a) for a in data["abstract"]]
    
    def _load_body_text(self, data):
        
        if "body_text" in data.keys():
            self.text = [text_block(t) for t in data["body_text"]]
    
    def _load_bib(self, data):
        
        if "bib_entries" in data.keys():
            self.bib = [bib_item(b) for b in data["bib_entries"].values()]
    
    def _load_ref_entries(self, data):
        
        if "ref_entries" in data.keys():
            self.ref_entries = [ref_entries(r, data["ref_entries"][r]) for r in data["ref_entries"].keys()]
            
    def _load_back_matter(self, data):
        
        if "back_matter" in data.keys():
            self.back_matter = [back_matter(b) for b in data["back_matter"]]
        
    def load_file(self, file_path):
        
        if file_path:
            
            with open(file_path) as file:
                data = json.load(file)
                
                # call inbuilt data loading functions
                self.doc_filename = file_path
                self._load_paper_id(data)
                self._load_title(data)
                self._load_authors(data)
                self._load_abstract(data)
                self._load_body_text(data)
                self._load_bib(data)
                self._load_ref_entries(data)
                self._load_back_matter(data)
    
    def combine_data(self):
        
        self.data = {'doc_filename': self.doc_filename,
                     'doc_language': self.doc_language,
                     'paper_id': self.paper_id,
                     'title': self.title,
                     'authors':self.authors,
                     'abstract': self.abstract,
                     'text': self.text,
                     'bib_entries':self.bib,
                     'ref_entries': self.ref_entries,
                     'back_matter': self.back_matter,
                     'tripples': self.tripples,
                     'key_phrases': self.key_phrases,
                     'entities': self.entities}

    def extract_data(self):
        
        self.doc_filename = self.data['doc_filename']
        self.doc_language = self.data['doc_language']
        self.paper_id = self.data['paper_id']
        self.title = self.data['title']
        self.authors = self.data['authors']
        self.abstract = self.data['abstract']
        self.text = self.data['text']        
        self.bib = self.data['bib_entries']
        self.ref_entries = self.data['ref_entries']
        self.back_matter = self.data['back_matter']
        self.tripples = self.data['tripples']
        self.key_phrases = self.data['key_phrases']
        self.entities = self.data['entities']

    def save(self, dir):
        
        self.combine_data()

        if not os.path.exists(os.path.dirname(dir)):
            try:
                os.makedirs(os.path.dirname(dir))
            except OSError as exc:  # Guard against race condition
                if exc.errno != errno.EEXIST:
                    raise

        with open(dir, 'w') as json_file:
            json_file.write(json.dumps(self.data))

    def load_saved_data(self, dir):
        
        with open(dir) as json_file:
            self.data = json.load(json_file)
        self.extract_data()
    
    def print_items(self):
         
        print("---- Document Content ----") 
        print("doc_filename: " + str(self.doc_filename))
        print("doc_language: " + str(self.doc_language))
        print("paper_id: " + str(self.paper_id))
        print("title: " + str(self.title))
        print("\nAuthors: ")
        [a.print_items() for a in self.authors]
        print("\nAbstract: ")
        [a.print_items() for a in self.abstract]
        print("\nText: ")
        [t.print_items() for t in self.text]
        print("\nBib_entries: ")
        [b.print_items() for b in self.bib]
        print("\nRef_entries: ")
        [r.print_items() for r in self.ref_entries]
        print("\nBack_matter: ")
        [b.print_items() for b in self.back_matter]
        
        print("\nTripples: ")
        print(json.dumps(self.tripples, indent=4, sort_keys=True))
        print("\nKey Phrases: ")
        print(json.dumps(self.key_phrases, indent=4, sort_keys=True))        
        print("\nEntities: ")
        print(json.dumps(self.entities, indent=4, sort_keys=True))

    def clean_text(self, swap_dict=None):
        
        # clean all blocks of text
        [t.clean(swap_dict) for t in self.text]
    
    def clean_abstract(self, swap_dict=None):
        
        [t.clean(swap_dict) for t in self.abstract]
    
    def combine_text(self):
        
        # this function takes all text blocks within document.text and combines them into a single text_block object
        self.text = combine_text_block(self.text)
    
    def combine_abstract(self):
        
        self.abstract = combine_text_block(self.abstract)   
        
    def set_abstract_tripples(self):
                
        abstract_tripples = {}
        for i in range(0, len(self.abstract)):
            #for every block in the abstract, extract entity tripples
            self.abstract[i].clean()                       
            pairs, entities = get_entity_pairs(self.abstract[i].text)
            
            #if any tripples found
            if pairs.shape[0]>0:
                abstract_tripples["abstract_" + str(i)] = pairs.to_json()
                       
        self.tripples.update(abstract_tripples)
        
    def set_text_tripples(self):
        
        text_tripples = {}
        for i in range(0, len(self.text)):
            
            self.text[i].clean()                       
            pairs, entities = get_entity_pairs(self.text[i].text)
            if pairs.shape[0]>0:
                text_tripples["text_" + str(i)] = pairs.to_json()
                       
        self.tripples.update(text_tripples)
        
    def set_ref_tripples(self):
        
        ref_tripples = {}
        for r in self.ref_entries:
            pairs, entities = get_entity_pairs(r.text)
            if pairs.shape[0]>0:
                ref_tripples["ref_" + r.ref_id] = pairs.to_json()
        
        self.tripples.update(ref_tripples)
        
    def set_doc_language(self):
        # set the doc language based on the analysis of the first block within the abstract
        self.doc_language = get_text_language(self.text[0].text)
    

# Load Entities

Loading entity information into a pandas dataframe to aid in the creation of a network graph for analysis later. 
This step builds on previous work carried out in our notebook [**Load and Process Abstracts**](https://www.kaggle.com/johndoyle/fork-of-load-and-process-data-abstracts).

Steps:
1. To conserve memory a corpus look up table is used and the indices will replace the title's in data structures during analysis and will be retrieved later for presentation.
2. Triples are extracted from the document objects extracted from our previous work.
3. The entity pairs are extracted for analysis and loaded into a dataframe with the document metadata that will be used during analysis.


In [None]:
# 1

with open('/kaggle/input/publication-link-analysis/corpus_documents_lookup.json') as f:
    pub_dict = json.load(f)
    
def get_corpus_labels(corpus_dir, index):
    """
    A helper function that will be used to return publication titles from indices
    
    corpus_dir: File associated with the corpus index
    index:  A list of indices to be returned.
    """

    with open(corpus_dir) as file:
                corpus = json.load(file)
    return {value:key for key, value in corpus.items() if value in index}

In [None]:
# 2
entity_pairs_list = {}
for file in files:
    directory = file['dirpath']
    for filenames in file['filenames']:
        
        file = open(directory +'/'+filenames,'rb')
        list_of_pubs = pickle.load(file)
        for pub in list_of_pubs:
            if pub is not None:
                if pub.tripples != {}:
                    subjects = []
                    relations = []
                    objects = []
                    entity_pairs_list[pub_dict[pub.title]] = {'tripples':pub.tripples, 'file_path':str(pub.doc_filename)}

        del list_of_pubs
del pub_dict

In [None]:
# 3
entity_pair_df_list = []
for k, v in entity_pairs_list.items():
    df_list = []
    for k_, v_ in v['tripples'].items():
        df_list.append(pd.read_json(v_)[['subject', 'relation', 'object']])
    df_ = pd.concat(df_list, ignore_index=True)
    df_['file_path'] = v['file_path']
    df_['publication'] = k
    entity_pair_df_list.append(df_)
    
del entity_pairs_list

Filter empty strings and numeric value only entity enteries as they do not provide any understandable information.

This filtering can be expanded to remove any additional unwanted entities before the graph is generated. 

In [None]:
entity_pairs_df = pd.concat(entity_pair_df_list, ignore_index=True)
entity_pairs_df = entity_pairs_df[entity_pairs_df.subject != '']
entity_pairs_df = entity_pairs_df[entity_pairs_df.object != '']
entity_pairs_df = entity_pairs_df[entity_pairs_df.object.str.isnumeric() != True]
entity_pairs_df = entity_pairs_df[entity_pairs_df.subject.str.isnumeric() != True]

# Create Entity Pair Graph
A network graph gives us a way to link structured and un-structured data for visualisation, exploration and analysis. Using concepts developed in graph theory, network graphs are popular methods of analysis and have been used in many fields such as mathematics, computer science, physics, chemistry, biology, sociology, and linguistics  to investigate a variety of systems including communication networks, social networks, biological nervous systems, and neural networks. [Pachayappan and Venkatesakumar (2018)](https://doi.org/10.4236/tel.2018.85067) apply a social network analysis based on graph theory to investigate influence among literature publications, and also provide a thorough overview of network graphs and their applications.

The nodes or vertices in a graph represent entities, while the links or edges between nodes represent the interconnections between entities. A simple visual introduction to several different types and layouts of network graphs is available [here](https://www.data-to-viz.com/graph/network.html).

In this analysis we leverage the NetworkX Python package to generate a network graph from a pandas dataframe. The entities are the nodes of the graph and the relationships are the edges which are enriched with publication meta data (title, path and page_rank).

Networkx: Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart, “Exploring network structure, dynamics, and function using NetworkX”, in Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, CA USA), pp. 11–15, Aug 2008

In [None]:
def create_kg(pairs):
    k_graph = nx.from_pandas_edgelist(pairs, 'subject', 'object',edge_attr = ['relation','publication', 'file_path'],
            create_using=nx.MultiDiGraph())
    return k_graph

In [None]:
G = create_kg(entity_pairs_df)
print(nx.info(G))

In [None]:
def plot_sub_graph(G, nodes, font = 12):
    """
    visualisation of the entire network graph is often un-helpful.
    This function creates a sub-graph from selected nodes and 
    plots the nodes, relationship and relative importance.

    
    G: networkx graph object.
    nodes: a list of nodes to create a sub-graph object.
    font: a variable to modify the size of the text and nodes in the graph. default at 12 for large sub-graphs.
    """
    
    G_sub = G.subgraph(nodes)
    node_deg = nx.degree(G_sub)
    layout = nx.spring_layout(G_sub, k=0.25, iterations=20)
    plt.figure(num=None, figsize=(120, 90), dpi=80)
    nx.draw_networkx(
        G_sub,
        node_size=[int(deg[1]) * 500*(font/12) for deg in node_deg],
        arrowsize=40,
        linewidths=2.5,
        pos=layout,
        edge_color='red',
        edgecolors='black',
        node_color='white',
        font_size = font, 
        )

    subject = []
    obj = []
    relation = []
    tasks = []
    for element in list(G_sub.edges()):
            subject.append(element[0])
            obj.append(element[1])
            relation.append(G_sub.get_edge_data(element[0],element[1])[0]['relation'])
    labels = dict(zip(list(zip(subject, obj)),relation))
    nx.draw_networkx_edge_labels(G_sub, pos=layout, edge_labels=labels,
                                     font_color='black', font_size=font)
    plt.axis('off')
    plt.show()

The following sub-graph is a selection of top network nodes ranked by the degree of centrality within the network.
[Degree of centrality](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.degree_centrality.html#networkx.algorithms.centrality.degree_centrality) is calculated as the fraction of nodes that a single node is connected to. 

In [None]:
node_rank = nx.degree_centrality(G)
node_rank_sorted = {k: v for k, v in sorted(node_rank.items(), key=lambda item: item[1],reverse=True)}
top_nodes = [k for k in node_rank_sorted.keys()][1:1000]
plot_sub_graph(G, top_nodes)

# Evaluate Graph Connections

The following function are used to evaluate the paths between two nodes and return appropriate metadata for analysis

In [None]:
def get_all_simple_paths(G, source, target, cutoff):
    """
    Return all simple paths between source and target and enrich 
    the path information with the edge attributes
    
    Input
    G: networkx graph object
    source: node in G
    target: node in G
    cutoff: maximum path depth 
    
    output
    publications: list of publications attributed to each edge.
    file_paths: list of file paths associated with the publications.
    paths: list paths which are ordered list of nodes in each path.
    
    """
    publications = []
    file_paths = []
    paths = [p for p in nx.all_simple_paths(G,source=source,target=target, cutoff = cutoff)]
    i = 0
    for path in paths:
        s = ""+ path[0]
        for i in range(len(path)-1):
            relation = []
            for k,v in G.get_edge_data(path[i], path[i+1]).items():
                publications.append(v['publication'])
                file_paths.append(v['file_path'])
                relation.append(v['relation'])
            relation = "/".join(relation)
            s = s + "  --->  " + relation + "  --->  " + path[i+1]
        if i< 2:
            print(s)
        i = i+1
        if i < 100:
            break
        
    return publications, file_paths, paths

In [None]:
def get_shortest_paths(G, source, target):
    
    """
    Return shortest paths between source and target and enrich 
    the path information with the edge attributes
    
    Input
    G: networkx graph object
    source: node in G
    target: node in G
    
    output
    publications: list of publications attributed to each edge.
    file_paths: list of file paths associated with the publications.
    
    """
    
    publications = []
    file_paths = []
    path = nx.shortest_path(G,source=source,target=target)
    s = ""+ path[0]
    for i in range(len(path)-1):
        relation = []
        for k,v in G.get_edge_data(path[i], path[i+1]).items():
            publications.append(v['publication'])
            file_paths.append(v['file_path'])

            relation.append(v['relation'])
        relation = "/".join(relation)
        s = s + "  --->  " + relation + "  --->  " + path[i+1]
    print(s)
        
    return publications, file_paths

# Resolve Nodes Using Pubmed word2vec Model

### Word2Vec - Neural Word Embeddings:

The vectors we use to represent words are called neural word embeddings, and representations are strange. One thing describes another, even though those two things are radically different. As Elvis Costello said: “Writing about music is like dancing about architecture.” Word2vec “vectorizes” about words, and by doing so it makes natural language computer-readable – we can start to perform powerful mathematical operations on words to detect their similarities. 

Using a word to predict a target context the model learns word vectors of the training corpus. When applied to a corpus that is specific to a single discipline, the word embedding learns the context in which specific terms are related. In this work we propose using a word vector trained on medical texts to improve the accuracy of processing, resolution and analysis of entities extracted from the CORD corpus.

The below word vector has been pre-trained on PubMed data and is sourced from:
 - Distributional Semantics Resources for Biomedical Text Processing. Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski and Sophia Ananiadou. LBM 2013 [{1}](http://bio.nlplab.org/)  [{2}](http://bio.nlplab.org/pdf/pyysalo13literature.pdf).


In [None]:
def get_pubmed_word_vector():
    ! wget http://evexdb.org/pmresources/vec-space-models/PubMed-w2v.bin
        
try:
    model = gensim.models.KeyedVectors.load_word2vec_format("/kaggle/input/w2vmodel/PubMed-w2v.bin", binary=True)
except:
    get_pubmed_word_vector()
    model = gensim.models.KeyedVectors.load_word2vec_format("PubMed-w2v.bin", binary=True)

In [None]:
def resolve_topic_list(G, topic_list, cuttoff = 0.85):
    """
    Resolving a list of topics against the nodes in the network graph G 
    using the similarity metric provided by a word2vec model.
    
    Input
    G: netwrokx graph object
    topic_list: list of topics which will be compared against nodes in the network.
    cuttoff: similarity metric threshold for returned nodes to be resolved.
    
    Output
    resolution_dictionary:  
  
    """

    resolution_dictionary = {}
    for topic in topic_list:
        node_select={}
        for node in G.nodes():
            try:
                node_select[str(node)] = model.n_similarity(topic.split(' '), str(node).split(' '))
            except:
                pass

        resolution_dictionary[topic] = {k: v for k, v in sorted(node_select.items(), key=lambda item: item[1], reverse=True) if v > cuttoff}
    return resolution_dictionary

In [None]:
Disease = ['covid-19', 'coronavirus','severe acute respiratory syndrome']
resolution_dictionary_disease = resolve_topic_list(G, Disease, cuttoff = 0.85)

In [None]:
w = widgets.Dropdown(
    options=list(resolution_dictionary_disease.keys()),
    description='Task:',
    value = 'covid-19'
)

def plot_weightings(D):
    plt.figure(figsize = [10,len(D)+1])
    plt.barh(range(len(D)), list(D.values()), align='center')
    plt.yticks(range(len(D)), list(D.keys()))
    plt.show()


def on_change(change):
    value = change['new']  
    D = resolution_dictionary_disease[value]
    clear_output()
    display(w)
    plot_weightings(D)
        

w.observe(on_change,'value')
display(w)
plot_weightings(resolution_dictionary_disease[w.value])

# Merge graph nodes
The resolved entities can be used to merge nodes in the graph, this will result in increased connectivity. Increased connectivity enables analysis of similar entities and how they link to our target term. 

This example we will use the resolved Disease terms to improve linkage to covid-19 and SARS nodes.


In [None]:
def merge_resolved_nodes(G, resolution_dictionary):
    """
    Useful for a small number of terms the nodes in the dictionary 
    are resolved to the root key for analysis.
    Using the networkx API can be computationally expensive.
    """
    G_ = G.copy(as_view=False)
    for k, v in resolution_dictionary.items():
        for key in v.keys():
            try:
                G_ = nx.contracted_nodes(G_,k, key)
            except:
                pass
    return G_

def replace_entities_in_df(resolution_dictionary):
    """
    An alternative to the merge_resolved_nodes for larger dictionaries. 
    We build a dictionary to leverage the string replace API in pandas. 
    """
    # prepair replace dictionary
    replace_dict = {}
    for k, v in resolution_dictionary.items():
        for key in v.keys():
            replace_dict[key] = k
    return replace_dict

In [None]:
replace_dict = replace_entities_in_df(resolution_dictionary_disease)
entity_pairs_disease_resolved_df = entity_pairs_df.replace(to_replace = replace_dict)

In [None]:
# A list of common alaises for covid-19
covid_19_names = ["2019 novel coronavirus disease", "2019 novel coronavirus infection", 
                  "2019-nCoV disease", "2019-nCoV infection", "COVID-19 pandemic",
                  "COVID-19 virus disease", "COVID-19 virus infection", "COVID19",
                  "SARS-CoV-2 infection", "coronavirus disease 2019", "coronavirus disease-19", "coronavirus"]

covid_19_names = [name.lower() for name in covid_19_names]

replace_dict = {}
for name in covid_19_names:
    replace_dict[name] = 'covid-19'
    
entity_pairs_disease_resolved_df = entity_pairs_disease_resolved_df.replace(to_replace = replace_dict)

G_ = create_kg(entity_pairs_disease_resolved_df)
print(nx.info(G_))

In [None]:
def resolve_entities_graph(G, Risks,entity_pairs_df):
    resolution_dictionary_risk = resolve_topic_list(G, Risks, cuttoff = 0.825)
    replace_dict = replace_entities_in_df(resolution_dictionary_risk)
    entity_pairs_resolved_df = entity_pairs_df.replace(to_replace = replace_dict)
    return create_kg(entity_pairs_resolved_df)

# Analysis
## Background Research

There are several different lenses through which to view the Risk Factors related to COVID-19. We are concentrating on the below sub-tasks:

**1. Risk Factors Relating to Contracting COVID-19**

Some demographics and sections of society are at a higher risk of contracting COVID-19. Risk factors include:

* Occupation.

  - The World Health Organization (WHO) notes that "[p]eople most at risk of acquiring the disease are those who are in contact with or care for patients with COVID-19." [1] Front-line personnel and those working in jobs that cannot be performed remotely and that require attendance at a workplace and interaction with others are at increased risk of contracting COVID-19. These occupations include healthcare workers, support staff in hospitals and healthcare facilities, law enforcement and military, essential retail, warehouse and distribution staff.
  
  - There has been an especially high impact on healthcare workers. The European Centre for Disease Prevention and Control (ECDC) estimates that between 9% and 26% of the total number of diagnosed COVID-19 cases in EU and EEA countries are healthcare workers. [2]  In addition, the WHO notes that healthcare workers face hazards such as pathogen exposure, psychological distress, long working hours, fatigue, occupational burnout, stigma, and physical and psychological violence. [3-4] 
         
      
* Living in certain types of residential facility.

  - Residents of Long Term Care Facilities (LTCF) such as nursing homes have been identified as "vulnerable populations who are at a higher risk for adverse outcome and for infection due to living in close proximity to others" by the WHO. [5] Some of the hazards presented by LTCF include large populations in close confinement, leading to difficulties in implementing physical distancing and isolation measures, as well as shortages in Personal Protective Equipment (PPE) and insufficient training of staff members in infection control measures and use of PPE. In some situations healthcare workers may move between multiple LTCF, posing a risk of cross-contamination.  A high number of clusters of infections have been reported in LTCF across Europe and the United States. [6-9]   
  
   
* Travel to regions with high levels of community transmission. [10]


* Close contact with an infected person. [10]


* Belonging to a vulnerable section of society, where there may be reduced ability to practice public health measures such as hand-washing and self-isolating. The ECDC notes that, with respect to communicating the risks around COVID-19, "[v]ulnerable individuals including the elderly, those with underlying health conditions, disabled people, people with mental health problems, homeless people, and undocumented migrants will require extra support..." [11]
    


**List of Risk Factors Relating to Contracting COVID-19 - Occupational Risk factors - Healthcare Workers**

['pathogen exposure', 'droplet', 'fomites', 'airborne', 'respiratory hygiene', 'PPE', 'personal protective equipment', 'aerosol', 'particulate respirator', 'secretions', 'fluids', 'intubation', 'ventilation', 'IPC', 'infection prevention control', 'gloves', 'mask', 'face shield', 'goggles', 'gowns', 'supply chain', 'disruption']

References: [7, 22, 59, 60]

In [None]:
Risk_Occupation = ['pathogen exposure', 'droplet', 'fomites', 'airborne', 'respiratory hygiene',
 'PPE', 'personal protective equipment', 'aerosol', 'particulate respirator',
 'secretions', 'fluids', 'intubation', 'ventilation', 'IPC', 'infection prevention control',
 'gloves', 'mask', 'face shield', 'goggles', 'gowns', 'supply chain', 'disruption']

In [None]:
G_risk = resolve_entities_graph(G, Risk_Occupation, entity_pairs_disease_resolved_df)
print(nx.info(G_risk))

## Get all simple paths and display a sample

In [None]:
Virus = 'covid-19'
node_risk_dict = {}
for risk in Risk_Occupation:
    print(Virus + ' - has relationship with - ' + risk)
    print('\n\n')
    publications, file_paths, nodes = get_all_simple_paths(G_risk, source = Virus, target = risk, cutoff=4)
    
    publications = get_corpus_labels('/kaggle/input/publication-link-analysis/corpus_documents_lookup.json',publications).values()

    print('\n From the following publications: \n')
    print('\n'.join(publications))
    print('\n -----------------------------------------------------------------------------------')
    node_risk_dict[risk] = {'nodes': [node for node_list in nodes for node in node_list], 
                            'publications': dict(zip(publications, file_paths))}

## Display Sub-Graph Associated with Risk 

In [None]:
out = widgets.Output(layout={'border': '1px solid black'})


w1 = widgets.Dropdown(
    options=list(node_risk_dict.keys()),
    description='Task:',
)

def display_entities(text):
    doc = nlp(text)
    html = displacy.render(doc, style="ent")
    display(HTML(html))
    
def on_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        value = change['new']
        dictionary = node_risk_dict[value]
        clear_output()
        display(w1)
        plot_sub_graph(G_, dictionary['nodes'], font = 40)
        try:
            w2.options = dictionary['publications'].keys()
        except:
            pass



w1.observe(on_change)       

display(w1)

dictionary = node_risk_dict[w1.value]
plot_sub_graph(G_, dictionary['nodes'], font = 40)


## Display Named Entity Recognition for Text of Publications

In [None]:
out = widgets.Output(layout={'border': '1px solid black'})


w2 = widgets.Dropdown(
    options=list(node_risk_dict[w1.value]['publications'].keys()),
    description='Task:',
)

def display_entities(text):
    doc = nlp(text)
    html = displacy.render(doc, style="ent",minify=True)
    display(HTML(html))
    
def on_change_print(change):
    if change['type'] == 'change' and change['name'] == 'value':
        value = change['new']
        doc = document(node_risk_dict[w1.value]['publications'][value])
        doc.combine_text()
        clear_output()
        display(w2)
        display_entities(doc.text[0].text)
        
w2.observe(on_change_print)

display(w2)
doc = document(node_risk_dict[w1.value]['publications'][w2.value])
doc.combine_text()
display_entities(doc.text[0].text)

## Output findings for later analysis

In [None]:
with open('/kaggle/working/Risk_Factors_Relating_to_Contracting_results.json', 'w')as f:
    json.dump(node_risk_dict,f)
    
refined_results = {}
agg_pub = []
agg_risk = []
for k, v in node_risk_dict.items():
    refined_results[k] = list(set(v['publications']))
    agg_pub.append(list(v['publications']))
    for pub in v['publications']:
        agg_risk.append(k)
agg_pub = [pub for pubs in agg_pub for pub in pubs]


with open('/kaggle/working/Risk_Factors_Relating_to_Contracting_results_refined.json', 'w')as f:
    json.dump(refined_results,f)

## Get Top Publications

In [None]:
def display_top_texts(agg_pub, agg_risk):
    pd.options.display.max_rows = 25
    df = pd.DataFrame({'Publications':agg_pub, 'Risks':agg_risk})
    df['Occurrence'] = 1
    df = df.groupby('Publications').agg({'Occurrence':['sum'], 'Risks':[lambda x: set(x)]})
    df.columns = df.columns.droplevel(1)
    return df.sort_values('Occurrence', ascending=False)

df_contracting = display_top_texts(agg_pub,agg_risk)
df_contracting.to_csv('/kaggle/working/Risk_Factors_Relating_to_Contracting.csv')

df_contracting


**2. Risk Factors Relating to the Likelihood of Experiencing a Severe Illness Once Infected**

The WHO advises that 80% of COVID-19 patients will experience a mild illness, approximately 14% will experience a severe illness, and 5% will be critically ill. [12]  Those patients experiencing severe and critical illness will require hospitalisation and in some cases specialised medical care such as admission to Intensive Care Units (ICU) and access to ventilators and other equipment.

Once infected, what are the risk factors associated with a more severe illness?

    
* Smoking - Risk factors include increased hand-to-mouth contact, as well as increased likelihood of pre-existing lung conditions or reduced lung capacity. [13]
* Pre-existing or underlying medical conditions (co-morbidities) such as high blood pressure, cardiovascular disease, chronic respiratory disease, cancer, diabetes, chronic kidney disease, liver disease, as well as conditions causing patients to be immunocompromised. [5, 10, 14-16]
* Age - The WHO note that COVID-19 "causes higher mortality in people aged ≥60 years". [5] Similarly, the Centers for Disease Control and Prevention (CDC) advise that "8 out of 10 deaths reported in the U.S. have been in adults 65 years old and older" [17], while in China, "the case fatality rate was highest among older persons..." [16]
* Size of 'infective dose' - The amount of viral particles received initially may be correlated to the progression of the infection, as "...the outcome of infection...can sometimes be determined by how much virus actually got into your body and started the infection off..." [18]


**List of Risk Factors Relating to the Likelihood of Experiencing a Severe Illness Once Infected**

['smoking', 'reduced lung capacity', 'hand to mouth contact', 'lung disease', 'oxygen', 'chronic', 'cancer', 'high blood pressure', 'diabetes', 'cardiovascular', 'chronic respiratory disease', 'heart disease', 'liver disease', 'kidney disease', 'immunocompromised', 'comorbidities', 'infective dose', 'infectious dose', 'viral load']

References: [5, 10, 13-16, 18]

In [None]:
Risk_Likelyhood = ['smoking', 'reduced lung capacity', 'hand to mouth contact', 'lung disease',
                   'oxygen', 'chronic', 'cancer', 'high blood pressure', 'diabetes', 'cardiovascular',
                   'heart', 'chronic respiratory disease', 'heart disease', 'liver disease', 'kidney disease',
                   'immunocompromised', 'comorbidities', 'infective dose', 'infectious dose', 'viral load']


In [None]:

G_risk = resolve_entities_graph(G, Risk_Likelyhood, entity_pairs_disease_resolved_df)
print(nx.info(G_risk))

## Get all simple paths and display a sample

In [None]:
Virus = 'covid-19'
node_risk_dict = {}
for risk in Risk_Likelyhood:
    print(Virus + ' - has relationship with - ' + risk)
    print('\n\n')
    publications, file_paths, nodes = get_all_simple_paths(G_risk, source = Virus, target = risk, cutoff=4)
    
    publications = get_corpus_labels('/kaggle/input/publication-link-analysis/corpus_documents_lookup.json',publications).values()

    print('\n From the following publications: \n')
    print('\n'.join(publications))
    print('\n -----------------------------------------------------------------------------------')
    node_risk_dict[risk] = {'nodes': [node for node_list in nodes for node in node_list], 
                            'publications': dict(zip(publications, file_paths))}


## Display Sub-Graph Associated with Risk 

In [None]:
out = widgets.Output(layout={'border': '1px solid black'})


w1 = widgets.Dropdown(
    options=list(node_risk_dict.keys()),
    description='Task:',
)

w1.observe(on_change)       

display(w1)

dictionary = node_risk_dict[w1.value]
plot_sub_graph(G_, dictionary['nodes'], font = 40)



## Display Named Entity Recognition for Text of Publications

In [None]:
out = widgets.Output(layout={'border': '1px solid black'})


w2 = widgets.Dropdown(
    options=list(node_risk_dict[w1.value]['publications'].keys()),
    description='Task:',
)
        
w2.observe(on_change_print)

display(w2)
doc = document(node_risk_dict[w1.value]['publications'][w2.value])
doc.combine_text()
display_entities(doc.text[0].text)

## Output findings for later analysis

In [None]:
with open('/kaggle/working/Likelihood_of_Experiencing_a_Severe_Illness_results.json', 'w')as f:
    json.dump(node_risk_dict,f)
    
refined_results = {}
agg_pub = []
agg_risk = []
for k, v in node_risk_dict.items():
    refined_results[k] = list(set(v['publications']))
    agg_pub.append(list(v['publications']))
    for pub in v['publications']:
        agg_risk.append(k)
agg_pub = [pub for pubs in agg_pub for pub in pubs]
    
with open('/kaggle/working/Likelihood_of_Experiencing_a_Severe_Illness_results_refined.json', 'w')as f:
    json.dump(refined_results,f)

## Get Top Publications

In [None]:
df_contracting = display_top_texts(agg_pub,agg_risk)
df_contracting.to_csv('/kaggle/working/Likelihood_of_Experiencing_a_Severe_Illness.csv')

df_contracting


**3. Risk Factors Relating to Community Transmission**

There are a number of aspects that determine the rate of community transmission.

* Incubation periods can be up to 14 days, with the average 5-6 days. [19] During this time, pre-symptomatic people can transmit the virus. [19-20] In addition, asymptomatic transmission can occur from people who do not experience symptoms. [19; 21]


* The public health measures that can be imposed with the aim of "limit[ing] the impact of the pandemic on healthcare systems and vulnerable population groups by delaying the epidemic peak and decreasing the magnitude of the peak". [11]
   
   * There are a number of public health measures which can be rolled out. Contact tracing involves the systematic identification of "all social, familial/household, work, health care, and any other contacts" [22] with a view to limiting pre-symptomatic or asymptomatic spread. Widespread testing is vital to understand the levels of infection, with the WHO identifying "[l]imited testing capacity in many countries globally" as a risk factor for the spread of COVID-19. [23] Frequent and thorough hand-washing and social or physical distancing when in public are recommended by the WHO and ECDC. [24; 11]
   
   * Restrictions on mass gatherings [25] and the closure of schools, universities, creches, workplaces, shops, restaurants and bars can be implemented [26], as well as restrictions on citizens leaving their homes. [27]
   
   * The effectiveness of the public health measures will depend on when they are introduced, and the levels of public compliance with the measures and restrictions. The ECDC notes that "[m]onitoring systems should be put in place to observe public perceptions, opinions and compliance with individual measures." [11] A related topic is how compliance is enforced. There have been a range of different approaches, from a relatively low number of restrictions in South Korea coupled with widespread testing [28-29], to country-wide quarantine in Italy [30], to relatively late introduction of restrictions in the UK which enabled mass gatherings such as the Cheltenham Festival to go ahead [31], to a relatively low testing rate in the US [32], to restrictions being enforced through fines and arrests in France [33-34], to mandatory quarantine for travelers in China [35].  Other considerations are the manner in which restrictions are communicated, with reports of mass travel in Italy arising from planned restrictions being leaked by media organisations. [36-37] The proliferation of misinformation and disinformation on social media can also affect public compliance with health measures, with the ECDC noting that "[p]rocedures for identifying and rapidly addressing misinformation, disinformation and rumours, especially on social media platforms, should be established." [11]

**List of Risk Factors Relating to Community Transmission**

['surfaces', 'contamination', 'pathogens', 'isolation', 'quarantine', 'physical distancing', 'social distancing', 'recognition', 'identification', 'source control', 'fever', 'cough', 'shortness of breath', 'contacts', 'human-to-human transmission', 'triage', 'respiratory hygiene', 'coughing', 'sneezing', 'elbow', 'hand hygiene', 'awareness', 'mitigation', 'delay', 'peak', 'testing', 'contact tracing', 'surveillance', 'early detection', 'risk communication', 'infection control', 'impact on healthcare system', 'preparedness', 'public awareness', 'compliance', 'misinformation', 'disinformation', 'social media', 'surge', 'vulnerable', 'asymptomatic', 'pre-symptomatic', 'droplet', 'mass gatherings', 'restrictions', 'quarantine', 'public health', 'incubation']


References:  [5, 11, 19, 22, 60]

In [None]:
Risk_Community = ['surfaces', 'contamination', 'pathogens', 'isolation', 'quarantine',
                  'physical distancing', 'social distancing', 'recognition', 'identification',
                  'source control', 'fever', 'cough', 'shortness of breath', 'contacts',
                  'human-to-human transmission', 'triage', 'respiratory hygiene', 'coughing',
                  'sneezing', 'elbow', 'hand hygiene', 'awareness', 'mitigation', 'delay', 'peak',
                  'testing', 'contact tracing', 'surveillance', 'early detection', 'risk communication',
                  'infection control', 'impact on healthcare system', 'preparedness', 'public awareness',
                  'compliance', 'misinformation', 'disinformation', 'social media', 'surge', 'vulnerable', 
                  'asymptomatic', 'pre-symptomatic', 'droplet', 'mass gatherings', 'restrictions', 'quarantine', 
                  'public health', 'incubation']

In [None]:
G_risk = resolve_entities_graph(G, Risk_Community, entity_pairs_disease_resolved_df)
print(nx.info(G_risk))

## Get all simple paths and display a sample

In [None]:
Virus = 'covid-19'
node_risk_dict = {}
for risk in Risk_Community:
    print(Virus + ' - has relationship with - ' + risk)
    print('\n\n')
    publications, file_paths, nodes = get_all_simple_paths(G_risk, source = Virus, target = risk, cutoff=4)
    
    publications = get_corpus_labels('/kaggle/input/publication-link-analysis/corpus_documents_lookup.json',publications).values()

    print('\n From the following publications: \n')
    print('\n'.join(publications))
    print('\n -----------------------------------------------------------------------------------')
    node_risk_dict[risk] = {'nodes': [node for node_list in nodes for node in node_list], 
                            'publications': dict(zip(publications, file_paths))}

## Display Sub-Graph Associated with Risk 

In [None]:
out = widgets.Output(layout={'border': '1px solid black'})


w1 = widgets.Dropdown(
    options=list(node_risk_dict.keys()),
    description='Task:',
)

w1.observe(on_change)       

display(w1)

dictionary = node_risk_dict[w1.value]
plot_sub_graph(G_, dictionary['nodes'], font = 40)


## Display Named Entity Recognition for Text of Publications

In [None]:
out = widgets.Output(layout={'border': '1px solid black'})


w2 = widgets.Dropdown(
    options=list(node_risk_dict[w1.value]['publications'].keys()),
    description='Task:',
)
        
w2.observe(on_change_print)

display(w2)
doc = document(node_risk_dict[w1.value]['publications'][w2.value])
doc.combine_text()
display_entities(doc.text[0].text)

## Output findings for later analysis

In [None]:
with open('/kaggle/working/Risk_Factors_Relating_to_Community_Transmission_results.json', 'w')as f:
    json.dump(node_risk_dict,f)
    
refined_results = {}
agg_pub = []
agg_risk = []
for k, v in node_risk_dict.items():
    refined_results[k] = list(set(v['publications']))
    agg_pub.append(list(v['publications']))
    for pub in v['publications']:
        agg_risk.append(k)
agg_pub = [pub for pubs in agg_pub for pub in pubs]
    
with open('/kaggle/working/Risk_Factors_Relating_to_Community_Transmission_results_refined.json', 'w')as f:
    json.dump(refined_results,f)

## Get Top Publications

In [None]:
df_contracting = display_top_texts(agg_pub,agg_risk)
df_contracting.to_csv('/kaggle/working/Risk_Factors_Relating_to_Community_Transmission.csv')

df_contracting


**4.  Risk Factors Relating to Adverse Socio-economic Impacts from the Virus**

Adverse socio-economic impacts include loss of income through loss of employment, reduction in hours, etc; loss of access to educational opportunities, e.g. closure of schools and universities, and uncertainty around exams; inability or reluctance to access medical care for non-COVID-19 issues.

Risk factors are present at both macro and micro levels, i.e. the risk of a particular country experiencing adverse socio-economic impacts, and the risk of a particular citizen experiencing adverse socio-economic impacts.

Risk factors include:

* The severity and duration of the lockdown - The world is now in recession, with the International Monetary Fund (IMF) noting that the "economic damage is mounting across all countries". [38] The IMF expect this recession to be worse than the global financial crisis [38] and possibly "the worst economic fallout since the Great Depression". [39]


* Public compliance with lockdown measures over an extended period of time - There are concerns that 'isolation fatigue' may lead to increased flouting of restrictions. [40] In addition, there have been examples of growing social unrest in regions which have experienced severe restrictions for several weeks [41], as well as concerns that measures such as hotlines for informing on neighbours breaking restrictions may lead to social division. [42]


* The exit strategy from lockdown - The lifting of restrictions must be balanced against the risk of a possible second wave of infections. [43-45] There is a possibility of a cycle of lockdowns and easing of restrictions being implemented over several months. [46]


* Sector of employment - Sectors particularly impacted include aviation [47], tourism [48], hospitality [49], non-essential retail [49], manufacturing [49], and food production and agriculture [50].


* Type of employment - Temporary, part-time, seasonal and casual workers, and those on lower pay, minimum wage, or hourly contracts will be severely impacted.  The self-employed and those working in small to medium size enterprises are also at risk of adverse impacts. In the US, the initial jobless claims for the last two weeks in March exceeded 10 million. [51]


* Age - School and university students will be impacted by closures, loss of educational opportunities, and mental health issues relating to uncertainty around exams and future plans. Over 91% of the global student population are affected, with vulnerable and disadvantaged communities most severely impacted. [52] United Nations Educational, Scientific and Cultural Organization (UNESCO) advise that school closures may lead to "increased drop-out rates which will disproportionately affect adolescent girls, further entrench gender gaps in education and lead to increased risk of sexual exploitation, early pregnancy and early and forced marriage." [53]


* Ethnicity - There is evidence from a number of countries including the US [54] and the UK [55] that ethnic minorities are disproportionately affected by COVID-19. Proposed reasons include unequal access to healthcare and a greater proportion of ethnic minorities working in essential industries where remote working is not feasible. [54]


* Risk of preventable non-COVID-19 deaths - Hospitals globally are reporting significantly lower than normal admissions for non-COVID-19 conditions, leading to fears that people are ignoring warning signs as they are afraid of contracting COVID-19 while in hospital. [56-57]  Similar drops in emergency room admissions were seen in Canada during the SARS outbreak of 2002 [56], while in West Africa during the Ebola outbreak of 2014-2016, "more people died from lack of health-care access for non-Ebola needs than Ebola itself". [56]


* Impact of long-term isolation on mental health - People living under long-term restrictions may experience mental health issues. In addition, the uncertainty coupled with the constant stream of news about the progression of the pandemic can lead to worry and anxiety. [58]

**List of Risk Factors Relating to Adverse Socio-Economic Impact**

['economy', 'unemployment', 'job loss', 'redundancy', 'layoffs', 'jobless claim', 'unemployment benefit', 'sector', 'aviation', 'tourism', 'hospitality', 'retail', 'agriculture', 'manufacturing', 'restrictions', 'lockdown', 'shutdown', 'exit strategy', 'social welfare', 'school closure', 'ethnic minority', 'disproportionate', 'secondary deaths', 'social unrest', 'isolation fatigue', 'recession', 'depression', 'mental health', 'anxiety', 'worry', 'stress']


References:  [38, 41, 47-52, 54-55, 58]

In [None]:
Risk_Socio_Economic = ['economy', 'unemployment', 'job loss', 'redundancy', 'layoffs',
 'jobless claim', 'unemployment benefit', 'sector', 'aviation',
 'tourism', 'hospitality', 'retail', 'agriculture', 'manufacturing',
 'restrictions', 'lockdown', 'shutdown', 'exit strategy', 'social welfare',
 'school closure', 'ethnic minority', 'disproportionate', 'secondary deaths',
 'social unrest', 'isolation fatigue', 'recession', 'depression', 'mental health',
 'anxiety', 'worry', 'stress']

In [None]:
G_risk = resolve_entities_graph(G, Risk_Socio_Economic, entity_pairs_disease_resolved_df)
print(nx.info(G_risk))

## Get all simple paths and display a sample

In [None]:
Virus = 'covid-19'
node_risk_dict = {}
for risk in Risk_Socio_Economic:
    print(Virus + ' - has relationship with - ' + risk)
    print('\n\n')
    publications, file_paths, nodes = get_all_simple_paths(G_risk, source = Virus, target = risk, cutoff=4)
    
    publications = get_corpus_labels('/kaggle/input/publication-link-analysis/corpus_documents_lookup.json',publications).values()

    print('\n From the following publications: \n')
    print('\n'.join(publications))
    print('\n -----------------------------------------------------------------------------------')
    node_risk_dict[risk] = {'nodes': [node for node_list in nodes for node in node_list], 
                            'publications': dict(zip(publications, file_paths))}

## Display Sub-Graph Associated with Risk 

In [None]:
out = widgets.Output(layout={'border': '1px solid black'})


w1 = widgets.Dropdown(
    options=list(node_risk_dict.keys()),
    description='Task:',
)



w1.observe(on_change)       

display(w1)

dictionary = node_risk_dict[w1.value]
plot_sub_graph(G_, dictionary['nodes'], font = 40)


## Display Named Entity Recognition for Text of Publications

In [None]:
out = widgets.Output(layout={'border': '1px solid black'})


w2 = widgets.Dropdown(
    options=list(node_risk_dict[w1.value]['publications'].keys()),
    description='Task:',
)
        
w2.observe(on_change_print)

display(w2)
doc = document(node_risk_dict[w1.value]['publications'][w2.value])
doc.combine_text()
display_entities(doc.text[0].text)

## Output findings for later analysis

In [None]:
with open('/kaggle/working/Risk_Factors_Relating_to_Socio_Economic_results.json', 'w') as f:
    json.dump(node_risk_dict,f)
    
refined_results = {}
agg_pub = []
agg_risk = []
for k, v in node_risk_dict.items():
    refined_results[k] = list(set(v['publications']))
    agg_pub.append(list(v['publications']))
    for pub in v['publications']:
        agg_risk.append(k)
agg_pub = [pub for pubs in agg_pub for pub in pubs]

with open('/kaggle/working/Risk_Factors_Relating_to_Socio_Economic_results_refined.json', 'w') as f:
    json.dump(refined_results,f)

## Get Top Publications

In [None]:
df_contracting = display_top_texts(agg_pub,agg_risk)
df_contracting.to_csv('/kaggle/working/Risk_Factors_Relating_to_Socio_Economic.csv')

df_contracting

# Conclusions

The linkages and connections between the publications in the corpus have been displayed using a network graph, where the nodes of the graph represent publications and the edges of the graph represent links between entities. Entity resolution has been applied to the graph to enhance connectivity. A Word-to-Vec model pre-trained on a PubMed corpus was used to resolve the entities.

Several different topics relating to risk factors around COVID-19 were identified. A review of documents and resources from the World Health Organisation, European Centre for Disease Prevention and Control, the Centers for Disease Control and Prevention, other reputable organisations and news outlets was carried out in order to research each of the topics and construct a list of risk factors.

These lists of risk factors were then used to assess the linkage between each of the terms and 'covid-19' in the network graph. The shorter the path connecting a term with 'covid-19', the stronger the association.

For each of the risk factors, the list of related publications has been determined, along with the linked nodes in the network graph, and these results have been output as a json file. The publications have also been ranked by their frequency of occurrence across the different risk factors, i.e. how many risk factors a publication is linked to. This identifies the most relevant publications across a range of risk factors.

# Next Steps

The work presented in this notebook is the analysis and results of a body of work that includes:
1. [Background Research](https://www.kaggle.com/cddata/documenting-sub-tasks-dictionaries)
2. [Data Loading and Processing](https://www.kaggle.com/johndoyle/load-and-process-data-abstracts)
3. [Publications Analysis](https://www.kaggle.com/johndoyle/publication-link-analysis)

Each of these works can be continued and the approaches expanded, some tasks that would be targeted first:
1. Custom entity tagging using manually created or automated medical ontology
2. Graph enrichment and linkage improvement using the text body of the publications.
3. Enrichment of risks using custom [word embedding](https://www.kaggle.com/piyushrumao/word-embedding-analysis-on-covid-19-dataset), [document embedding](https://www.kaggle.com/piyushrumao/doc2vec-analysis-on-covid-19-datas) and  [topic analysis](https://www.kaggle.com/piyushrumao/topic-modelling-analysis-on-covid-19-dataset) which were explored during this project.
4. Target content extraction from publications to further reduce the burden of knowledge extraction.





# References


[1] https://apps.who.int/iris/bitstream/handle/10665/331496/WHO-2019-nCov-HCW_risk_assessment-2020.2-eng.pdf

[2] https://www.ecdc.europa.eu/sites/default/files/documents/covid-19-rapid-risk-assessment-coronavirus-disease-2019-eighth-update-8-april-2020.pdf

[3] https://iris.wpro.who.int/bitstream/handle/10665.1/14482/COVID-19-022020.pdf

[4] https://www.who.int/publications-detail/coronavirus-disease-(covid-19)-outbreak-rights-roles-and-responsibilities-of-health-workers-including-key-considerations-for-occupational-safety-and-health

[5] https://apps.who.int/iris/bitstream/handle/10665/331508/WHO-2019-nCoV-IPC_long_term_care-2020.1-eng.pdf

[6] https://www.theguardian.com/world/2020/apr/02/coronavirus-outbreaks-us-nursing-homes-lockdowns

[7] https://apnews.com/6b9b54cb3626906277232b8e2bdcf69a

[8] https://www.euronews.com/2020/03/24/coronavirus-elderly-found-dead-and-abandoned-in-spanish-nursing-homes

[9] https://www.theguardian.com/world/2020/apr/09/covid-19-hundreds-of-uk-care-home-deaths-not-added-to-official-toll

[10] https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200407-sitrep-78-covid-19.pdf?sfvrsn=bc43e1b_2

[11] https://www.ecdc.europa.eu/sites/default/files/documents/RRA-sixth-update-Outbreak-of-novel-coronavirus-disease-2019-COVID-19.pdf 

[12] https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200301-sitrep-41-covid-19.pdf?sfvrsn=6768306d_2    

[13] https://www.who.int/news-room/q-a-detail/q-a-on-smoking-and-covid-19

[14] https://www.who.int/news-room/q-a-detail/q-a-coronaviruses

[15] https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/groups-at-higher-risk.html

[16] https://www.cdc.gov/coronavirus/2019-ncov/hcp/clinical-guidance-management-patients.html

[17] https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/older-adults.html

[18] https://www.sciencemediacentre.org/expert-reaction-to-questions-about-covid-19-and-viral-load/

[19] https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200402-sitrep-73-covid-19.pdf?sfvrsn=5ae25bc7_2

[20] https://www.ecdc.europa.eu/sites/default/files/documents/covid-19-guidance-discharge-and-ending-isolation-first%20update.pdf

[21] https://www.weforum.org/agenda/2020/03/people-with-mild-or-no-symptoms-could-be-spreading-covid-19/

[22] https://www.who.int/publications-detail/considerations-in-the-investigation-of-cases-and-clusters-of-covid-19

[23] https://www.who.int/publications-detail/strategic-preparedness-and-response-plan-for-the-new-coronavirus

[24] https://www.who.int/emergencies/diseases/novel-coronavirus-2019/advice-for-public

[25] https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/points-of-entry-and-mass-gatherings

[26] https://www.ecdc.europa.eu/sites/default/files/documents/covid-19-social-distancing-measuresg-guide-second-update.pdf

[27] https://www.ecdc.europa.eu/en/publications-data/video-covid-19-stay-home-importance-social-distancing

[28] https://www.weforum.org/agenda/2020/03/south-korea-covid-19-containment-testing/

[29] https://www.sciencemag.org/news/2020/03/coronavirus-cases-have-dropped-sharply-south-korea-whats-secret-its-success

[30] https://www.bbc.com/news/world-europe-51810673

[31] https://www.irishtimes.com/sport/racing/cheltenham-faces-criticism-after-racegoers-suffer-covid-19-symptoms-1.4219458

[32] https://www.worldometers.info/coronavirus/covid-19-testing/

[33] https://www.france24.com/en/20200318-france-coronavirus-lockdown-violation-attestation-epidemic-christophe-castaner-public-health

[34] https://www.connexionfrance.com/French-news/France-arrests-people-for-flouting-confinement-rules-as-Macron-calls-for-confinement-rules-to-be-taken-more-seriously

[35] https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30421-9/fulltext

[36] https://www.cnbc.com/2020/03/09/italys-quarantine-provokes-panic-italian-stocks-plunge.html

[37] https://www.theguardian.com/world/2020/mar/08/leaked-coronavirus-plan-to-quarantine-16m-sparks-chaos-in-italy

[38] https://blogs.imf.org/2020/04/06/an-early-view-of-the-economic-impact-of-the-pandemic-in-5-charts/

[39] https://economictimes.indiatimes.com/news/economy/indicators/covid-19-imf-anticipates-sharply-negative-economic-growth-fallout-since-the-great-depression/articleshow/75067158.cms

[40] https://www.theguardian.com/world/2020/apr/04/uks-covid-19-lockdown-could-crumble-as-frustration-grows-police-warn

[41] https://www.theguardian.com/world/2020/mar/29/italy-sets-aside-400m-for-food-vouchers-as-social-unrest-mounts

[42] https://www.theguardian.com/uk-news/2020/apr/09/uk-police-tool-report-covid-19-rule-breakers-risks-fuelling-social-division

[43] https://www.irishtimes.com/news/world/us/us-immunologist-warns-against-hasty-return-to-business-as-usual-1.4227390

[44] https://edition.cnn.com/2020/04/10/asia/china-korea-singapore-coronavirus-second-wave-intl-hnk/index.html

[45] https://www.controlrisks.com/covid-19/covid-19-no-sign-of-a-second-wave-in-asia

[46] https://www.businessinsider.com/countries-may-need-more-lockdowns-coronavirus-2020-3?r=US&IR=T

[47] https://www.ilo.org/sector/Resources/publications/WCMS_741466/lang--en/index.htm

[48] https://www.ilo.org/sector/Resources/publications/WCMS_741468/lang--en/index.htm

[49] https://news.un.org/en/story/2020/04/1061322

[50] http://www.fao.org/2019-ncov/q-and-a/impact-on-food-and-agriculture/en/

[51] https://www.telesurenglish.net/news/US-Jobless-Claims-Surge-Rises-to-10-Million-Due-to-COVID-19-20200402-0013.html

[52] https://en.unesco.org/covid19/educationresponse

[53] https://en.unesco.org/news/covid-19-school-closures-around-world-will-hit-girls-hardest

[54] https://www.theguardian.com/world/2020/apr/08/its-a-racial-justice-issue-black-americans-are-dying-in-greater-numbers-from-covid-19

[55] https://www.icnarc.org/Our-Audit/Audits/Cmp/Reports

[56] https://www.cbc.ca/news/health/covid-19-emergency-departments-canada-1.5510778

[57] https://www.thejournal.ie/tony-holohan-hospital-symptoms-5065098-Apr2020/

[58] https://www.mayoclinic.org/diseases-conditions/coronavirus/in-depth/mental-health-covid-19/art-20482731

[59] https://www.who.int/publications-detail/infection-prevention-and-control-during-health-care-when-novel-coronavirus-(ncov)-infection-is-suspected-20200125

[60] https://apps.who.int/iris/bitstream/handle/10665/331498/WHO-2019-nCoV-IPCPPE_use-2020.2-eng.pdf