# Introduction
The collection of documents provided for this challenge cover a wide breadth of topic areas and research objectives.  In order to address any of the specific research questions posed in the 10 tasks for this challenge, it will be useful to organize the corpus into meaningful topical areas and focus on the subset of documents most closely related to the question at hand.  There are many different approaches to the task of unsupervised clustering of documents and topic modeling.  On such approach is to organize the documents into a citation network and cluster documents together using an appropriate community detection algorithm, which has two important advantages:
* Researchers self-organize into communities focused around graular topic areas, and a citation network model naturally leverages this information to cluster documents.
* Using citations to documents outside of the corpus enriches the dataset with additional external information.

In this submission, we provide simple tools to construct a citation network and demonstrate how it can be used to quickly identify a cluster of documents related to a particular task, specifically the following question:  
* What do we know about co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities.

Authored by the Regeneron Machine Learning Club:
* [Leland Barnard](https://www.kaggle.com/lbvigilantdata) (leland.barnard@gmail.com)
* [ytisserant](https://www.kaggle.com/ytisserant) (ytisserant@gmail.com)
* [cfmleig](https://www.kaggle.com/cfmleig) (cf.mleig@gmail.com)
* [Aly Hafez](https://www.kaggle.com/alyhafez95) (alyhafez95@gmail.com)
* [Jingggggggg](https://www.kaggle.com/jingggggggg) (Violet.hj@gmail.com)
* [OZhuYuan](https://www.kaggle.com/ozhuyuan) (yuanzhu26@gmail.com)
* [Katie Gayvert](https://www.kaggle.com/katiegayvert) (katiegayvert@gmail.com)
* [Daren Liu](https://www.kaggle.com/darenliu94) (darenliu@berkeley.edu)
* [Rickcopin](https://www.kaggle.com/rickcopin) (rickcopin@gmail.com)
* [jfkaggles](https://www.kaggle.com/jfpostbox) (jan.freudenberg@gmail.com)
* [Kevia Q](https://www.kaggle.com/keviaq) (kevia.q@gmail.com)

In [None]:
from IPython.display import Image
Image(filename='/kaggle/input/citationnetworkfigure/citationNetwork.png', width=400)

Documents 1, 2, and 3 from the corpus are likely to be related because they all cite a common external article.

For readability and portability, the bulk of the code for this submission is organized in an external repository, which can installed via pip:

In [None]:
!pip install git+https://github.com/ReReMLclub/cord19utils.git/ --quiet

Documentation for cord19utils is available in the repository README: https://github.com/ReReMLclub/cord19utils

# The Citation Network Model
Using the bibliographic information provided for each document in the corpus, a citation network model is constructed by creating nodes for each document in the corpus as well as the external artcles that they site.  Directed edges are drawn between each corpus document node and all of the external document nodes that it cites.  Once constructed, document clusters or communities are assigned based on the asynchronous fluid communities algorithm [1].

In [None]:
import numpy as np 
import pandas as pd 
import glob
import json
from collections import defaultdict
import cord19utils
import holoviews as hv
import networkx as nx

In [None]:
hv.extension('bokeh')
hv.output(size=300)

# Reading the documents
The first step is to prepare the documents to be read into the graph builder.  The CorpusReader object from cord19utils is used to ingest and format the json files for the corpus documents.  Elements of the document loading procedure are based on [2].

In [None]:
root_path = '/kaggle/input/CORD-19-research-challenge/'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})

In [None]:
filePaths = glob.glob(f'{root_path}/**/*.json', recursive=True)

In [None]:
reader = cord19utils.CorpusReader(filePaths, meta_df)

# Constructing the graph
The document reader is then used by the GraphBuilder object to iterate through the corpus documents and populate the citation network.  Without any filtering, the graph can become quite dense, large, and noisey.  We therefore recommend setting cutoff values for both outward and inward citations.  See the documentation at https://github.com/ReReMLclub/cord19utils for detailed explanation of the parameters below.

In [None]:
builder = cord19utils.GraphBuilder(reader)

In [None]:
graph = builder.buildGraph(citeOutCutoff = 10, citeInCutoff = 50, weightBound = 3)
print(f'Number of nodes: {graph.number_of_nodes()}\nNumber of edges: {graph.number_of_edges()}')

# Assigning communities
After the graph has been constructed, the nodes in the graph are assigned to communities.  The number of communities that are found is specified by the user, and this parameter is set empirically.  A higher number will result in smaller and more numerous communities with specific and sometimes overlapping topic areas.  A smaller number will result in fewer, larger communities with more broad topic areas that convey less meaning.  In practice we find that 20-30 communities leads to high quality, interpretible communities without significant redundancy.

In [None]:
communities = builder.assignCommunities(graph, nCommunities = 25)

# Topic Modeling
Once communities have been discovered, topics are extracted from the title and abstract sections of their constituent documents via Latent Dirichlet Allocation (LDA).  This process is abstracted using the cord19utils TextProcessor object.

In [None]:
proc = cord19utils.TextProcessor(graph)
proc.buildDictionary()
proc.assignCommunityTopics(communities, verbose = True)

# Visualizating Inter-community Relationships
The relationships between communities can be visualized as a supergraph, or a graph of graphs.  In this graph model, each node represents a community from the citation network, and edges represent citations from documents in one community to external articles in another.  These edges are assigned a weight equal to the total number of such citations.  Edges can be pruned by a minimum edge weight.  cord19utils can then be used to draw a chord graph to represent the relationships between communities.

In [None]:
sgraph = builder.buildSupergraph(graph, communities, weightCutoff = 200)

In [None]:
cord19utils.drawChordGraph(sgraph, proc.id2label)

A more focused chord graph can be created consisting of a single node and it's neighbors using the nodeOfInterest parameter.  In this case, we are interested in articles pertaining to co-infections and co-morbitities, and we see that one of the communities is related to asthma, which may be an important factor in raising risk of severe COVID-19 complications.

In [None]:
for node, label in proc.id2label.items():
    if 'asthma' in label: nodeOfInterest = node

In [None]:
cord19utils.drawChordGraph(sgraph, proc.id2label, nodeOfInterest = nodeOfInterest)

Acknowledgements:

[1] Parés F., Garcia-Gasulla D. et al. [“Fluid Communities: A Competitive and Highly Scalable Community Detection Algorithm”](https://arxiv.org/pdf/1703.09307.pdf)

[2] [COVID-19 Literature Clustering](https://www.kaggle.com/maksimeren/covid-19-literature-clustering)