# COVID-19 literature network analysis using knowledge graphs and graph_tool
**This notebook showcases how network science can be used for the analysis of COVID-19 knowledge graphs.**   
     
**The data used to construct the present knowledge graph can be found [here](https://github.com/GillesVandewiele/COVID-KG/). We also take into account the data from the [CORD-19 dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).**
       
**For a sophisticated and fast network analysis, we use the *graph_tool* module.    
[*graph_tool* homepage](https://graph-tool.skewed.de/) [(Peixoto, 2014)](#cite-peixoto)**  


<a id="section-toc"></a>
# Table of Contents

* [0. Getting Started](#section-getting-started)
    - [0.1 Motivation](#subsection-motivation)
    - [0.2 Package installation](#subsection-installation)
    - [0.3 Import packages and data](#subsection-import)
    - [0.4 Generate a network from the knowledge graph using *graph_tool*](#subsection-generate)
    - [0.5 Understanding and analyzing knowledge graphs (from the network science perspective)](#subsection-understanding)
* [1. Centrality](#section-centrality)
    - [1.1 Degree centrality](#subsection-degree)
    - [1.2 Closeness centrality](#subsection-closeness)
    - [1.3 Betweenness centrality](#subsection-betweenness)
    - [1.4 PageRank](#subsection-pagerank)
    - [1.5 HITS](#subsection-hits)
    - [1.6 Other centrality measures](#subsection-other)
* [2. Clustering](#section-clustering)
    - [2.1 Local clustering](#subsection-local)
    - [2.2 Global clustering](#subsection-global)
* [3. Assortativity](#section-assortativity)
* [4. Combining insights from several measures for more in-depth analysis](#section-combining)
* [5. Outlook](#section-outlook)
* [References](#section-references)

<a id="section-getting-started"></a>
# 0. Getting started

<a id="subsection-motivation"></a>
## 0.1 Motivation
Network analysis is a very powerful tool that can reveal many relevant features hidden in various large datasets.     
Particularly, if the data represents literature related to the research on COVID-19, such as in the present notebook, network analysis allows to quickly find *influential* scientific contributions. You will see that the meaning of the word "influential" is broad, allowing you to perform the data analysis from many different angles. This flexibility can be very useful to explore the existing works, find out about existing tools or approaches that may be promising for specific procedures but not for others. Exploring and analysing the literature in this way can be not only more informative but also more efficient, as the reasearcher gains a high level of control to tailor the queries in a very detailed way.    
      
This notebook provides some examples for the tools, borrowed from network science, that can be applied to obtain detailed insights about the COVID-19 literature dataset.

I hope that you discover that the combination of the powerful *graph_tool* library with knowledge graphs can lead to exciting insights and is worth your while.

<a id="subsection-installation"></a>
## 0.2 Package installation
First, let's install all the necessary packages using conda (unfortunately, *graph_tool* cannot be installed using pip).   
The installation can take a while (possibly > 10min). So take a deep breath and go get a cup of coffee.

In [None]:
!conda install -y -c conda-forge graph-tool matplotlib

In [None]:
# This line fixes an issue with graph_tool installation on the current Kaggle kernel
!apt-get install libsigc++-2.0-0v5

In [None]:
!conda install -y -c conda-forge ipython jupyter numpy

In [None]:
!conda install -y -c conda-forge rdflib

<a id="subsection-import"></a>
## 0.3 Import packages and data

In [None]:
# Import all necessary modules
import rdflib
import graph_tool.all as gt
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


To create the knowledge graph, we'll adopt some lines of code from [this useful kernel.](https://www.kaggle.com/group16/covid-19-knowledge-graph-embeddings)

In [None]:
# Import the graph data
# This takes a while...
g = rdflib.Graph()
g.parse('/kaggle/input/covid19-literature-knowledge-graph/kg.nt', format='nt')

In [None]:
# In this kernel the analysis results are interpreted from the perspective of citation networks
# Therefore, it is important to reduce the data only to citation networks
for p1, _, p2 in g.triples((None, rdflib.URIRef("http://purl.org/spar/cito/isCitedBy"), None)):
    tr = (p2, rdflib.URIRef("http://purl.org/spar/cito/cites"), p1)
    if tr not in g:
        g.add(tr)
    g.remove((p1, rdflib.URIRef("http://purl.org/spar/cito/isCitedBy"), p2))

We can filter out the COVID-19 papers from the knowledge graph using the CORD-19 dataset.

In [None]:
import urllib
import pandas as pd
from tqdm.notebook import tqdm as ntqdm
metadata = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv')
dois = metadata['doi'].dropna().apply(lambda x: 'http://dx.doi.org/' + x.strip('doi.org').strip('http://dx.doi.org/')).values
dois = list(set(dois))

papers = []
for doi in ntqdm(dois):
    if len(list(g.triples((rdflib.URIRef(doi), rdflib.URIRef("http://purl.org/spar/cito/cites"), None)))) > 0:
        papers.append(doi)
print(len(papers))

However, note that currently (May 2020) most COVID-19 papers are very recent and, thus, they are not yet cited by any other works (at least its not yet included in the data set). This situation can be expected to drastically change in the coming months. But for the time being we continue our analysis on the original, unfiltered, knowledge graph.

In [None]:
# OPTIONAL: Display all the predicates to find out what kind of links we can analyse between the nodes
#for pr in set(g.predicates()):
#    print(pr)

In [None]:
# OPTIONAL: For example, here we can see a list of all the publishers that are included in the dataset
#all_publishers = []
#for s, p, o in g.triples((None, rdflib.URIRef("https://www.ica.org/standards/RiC/ontology#publishedBy"), None)):
#    if o not in all_publishers:
#        all_publishers.append(o)
#        print(o)

In [None]:
# OPTIONAL:
# Print a couple of papers that we could use as root to generate the network below
#glist=list(g.triples((None, rdflib.URIRef("http://purl.org/spar/cito/cites"), None)))[:5]
#for p in glist:
#    print(p)

<a id="subsection-generate"></a>
## 0.4 Generate a network from the knowledge graph using *graph_tool*

In [None]:
# This function converts the imported graph data into a graph_tool network
def create_sub_graph_gt(root, depth):
    it = 0
    vnames_inv = {"??INIT??": -1}
    # The vnames array is important! It is used to get the node label (which is a string) 
    # from the node id (which is an integer).
    vnames = {"-1": "??INIT??"}
    
    objects = set()
    
    gt_graph = gt.Graph()
    gt_graph.set_directed(True)
    
    to_explore = {root}
    for _ in range(depth):
        new_explore = set()
        for node in to_explore:
            for s, p, o in g.triples((node, rdflib.URIRef("http://purl.org/spar/cito/cites"), None)):

                s_name=str(s)
                o_name=str(o)
                
                if s_name != o_name:
                
                    if s_name not in vnames_inv:
                        vnames_inv[s_name] = it
                        vnames[str(it)] = s_name
                        gt_graph.add_vertex()
                        it=it+1
                    if o_name not in vnames_inv:
                        vnames_inv[o_name] = it
                        vnames[str(it)] = o_name
                        gt_graph.add_vertex()
                        it=it+1
                
                    v1 = gt_graph.vertex( vnames_inv[s_name] )
                    v2 = gt_graph.vertex( vnames_inv[o_name] )
                    gt_graph.edge(v1,v2,add_missing=True)
                
                new_explore.add(o)
        to_explore = new_explore
    return gt_graph, vnames, vnames_inv
 
# Get all the triples that are maximally n_h hops away from our randomly picked paper rand_paper 
rand_paper = rdflib.URIRef('http://dx.doi.org/10.1186/s12879-015-1251-y')
n_h = 100
# Generate the network based on the rdflib.Graph()
gt_graph, vnames, vnames_inv = np.array(create_sub_graph_gt(rand_paper, n_h))
# If, by any chance, we have parallel edges (i.e. a paper citing the same paper twice) 
# or self-loops (i.e. a paper citing itself), remove them
gt.remove_parallel_edges(gt_graph)
gt.remove_self_loops(gt_graph)
# Create an array of the node ids (or vertex ids) of the network
v_array=gt_graph.get_vertices()
print(v_array)

**Finally, let's define a couple of functions for good-looking plots**

In [None]:
# Scatter plot    
def plot_scatter(x, y, ylabel, xlabel="node id", scale=None, inset=None, figsz=None, xlim=None, ylim=None):
    if not figsz:
        figsz = (8, 5.5)
    fig = plt.figure(figsize=figsz)
    
    plt.xticks(fontsize=14)  
    plt.yticks(fontsize=14) 
    
    plt.xlabel(xlabel, fontsize=16)  
    plt.ylabel(ylabel, fontsize=16)
    
    ax = plt.gca()  
    ax.get_xaxis().tick_bottom()  
    ax.get_yaxis().tick_left() 
      
    if scale == "log":
        ax.set_xscale('log')
        ax.set_yscale('log') 
        ax.set_xlim([min(x)+1,max(x)+1])
        ax.set_ylim([min(y)+1,max(y)+1])
        
    if xlim:
        plt.xlim(xlim)
    if ylim:
        plt.ylim(ylim)
    
    plt.scatter(x, y, color="#004080", edgecolors='black', alpha=0.75) 
        
    plt.show()
    
#Histogram
def plot_hist(data, xlabel, ylabel="node count", bin_count=None, scale=None, inset=False, figsz=None, xlim=None):
    if not figsz:
        figsz = (8, 5.5)
    plt.figure(figsize=figsz)
    
    ax = plt.subplot(111)  
    ax.get_xaxis().tick_bottom()  
    ax.get_yaxis().tick_left()  
    
    plt.xticks(fontsize=14)  
    plt.yticks(fontsize=14) 
    
    plt.xlabel(xlabel, fontsize=16)  
    plt.ylabel(ylabel, fontsize=16)
      
    if scale == "log":
        plt.xscale("log")
        plt.yscale("log")
        if not bin_count:
            bin_count = 2 * int( len(tot_degs_arr)**.5  )
    elif not bin_count:
        bin_count = int( ( max(tot_degs_arr) - min(tot_degs_arr) ) / 10 )
        
    if inset:
        inset = plt.axes([.35, .3, .5, .5])
        inset.get_xaxis().tick_bottom()  
        inset.get_yaxis().tick_left()  
        inset.set_xscale("log")
        inset.set_yscale("log")
        bin_count = 2 * int( len(tot_degs_arr)**.5  )
        inset.hist(data, color="#004080", edgecolor='black', alpha=0.75, bins=bin_count)
        
    if xlim:
        plt.xlim(xlim)
    
    ax.hist(data, color="#004080", edgecolor='black', alpha=0.75, bins=bin_count) 
        
    plt.show()
    
# Network
def plot_graph(g, pos, vmap=None):
    if not vmap:
        vmap = g.degree_property_map("total")
        vmap.a = 4 * ( np.sqrt(vmap.a) * 0.5 + 0.4)
    gt.graph_draw(g, pos=pos, vertex_fill_color='#004080', vertex_size=vmap, \
                  vertex_halo=True, vertex_halo_color='black', vertex_halo_size=1.1, \
                  edge_color='gray')

**Now, let's visualize our network**

In [None]:
plot_graph(gt_graph, gt.arf_layout(gt_graph, max_iter=100, dt=1e-4))

In [None]:
# Another good-looking visualization
plot_graph(gt_graph, gt.sfdp_layout(gt_graph))

<a id="subsection-understanding"></a>
## 0.5 Understanding and analyzing knowledge graphs (from the network science perspective)

A few final notes before we get our hands dirty on some network analysis 
- The language in this notebook is intentionally chosen such that it is accessible to laymen. Thus, the introduction of terms and discussion of results are intentionally chosen to be not too complicating to understand and at the same time as complete as possible. Please keep in mind that there is almost always a trade-off between simplicity and accuracy
- For an introduction to knowledge graphs, please see [this kernel](https://www.kaggle.com/group16/covid-19-knowledge-graph-starter/notebook) and [wikipedia](https://en.wikipedia.org/wiki/Knowledge_Graph)
- A great introduction to networks and their properties can be found [here](https://global.oup.com/academic/product/networks-9780198805090?cc=be&lang=en&) [(Newman, 2018)](#cite-newman-book)
- In the context of this notebook it is mainly important to understand that the network nodes represent scientific publications, i.e. a network node = a (journal) paper. In the network, each node has an integer id that uniquely identifies the node, as well as a label that corresponds to the paper as given in the dataset. While *graph_tool* works with the integer node ids, you can get the label from the vnames array like this: label=vnames[str(node_id)]
- Moreover, a network link (or edge or connection) represents a citation. For example, if paper *i* points to paper *j*, it means that *i* cites *j*. 
- The insights gained from the analysis below is only as complete as the data used to generate the network. Due to the data size, the generated network might represent only a part of the data. 
- Moreover, the considered dataset is not isolated from the world but is by itself only a sample (even though a very large one)
- The analysis results mainly focus on particular properties of the network and are not necessarily mutually exclusive nor strictly inclusive. That is, a paper that has the highest in-degree may (or may not) also have a highest or lowest PageRank centrality (and vice versa).
- As always, **correlation != causation**!   

Finally, below is an example of how one can obtain information about a paper from its network node id.

In [None]:
# This is how we can get different info about a paper from vnames, 
# provided the data set contains that info
node_id = 0
# DOI
paper_doi = vnames[str(node_id)]
print("DOI: ", paper_doi)

# Paper reference in the knowledge graph based on the DOI
paper_ref = rdflib.URIRef(paper_doi)

# Authors
pred_firstname = rdflib.term.URIRef('http://xmlns.com/foaf/0.1/firstName')
paper_author_fn = list(g.triples((paper_ref, pred_firstname, None)))
if paper_author_fn:
    paper_author_fn = paper_author_fn[0][2]
    
pred_surname = rdflib.term.URIRef('http://xmlns.com/foaf/0.1/surname')
paper_author_sn = list(g.triples((paper_ref, pred_surname, None)))
if paper_author_sn:
    paper_author_sn = paper_author_sn[0][2]

if paper_author_fn or paper_author_sn:
    print("Author(s): %s, %s" % (paper_author_fn, paper_author_sn))
    
# Creator
pred_creator = rdflib.term.URIRef('http://purl.org/spar/pro/creator')
paper_creator = list(g.triples((paper_ref, pred_creator, None)))
if paper_creator:
    paper_creator = paper_creator[0][2]
    print("Creator: ", paper_creator)
    
# Publisher
pred_publisher = rdflib.term.URIRef('https://www.ica.org/standards/RiC/ontology#publishedBy')
paper_publisher = list(g.triples((paper_ref, pred_publisher, None)))
if paper_publisher:
    paper_publisher = paper_publisher[0][2]
    print("Publisher: ", paper_publisher)
    
# Title
# Here we define a function that we will use later in this notebook
def get_title(paper_ref):
    pred_title = rdflib.term.URIRef('http://purl.org/dc/terms/title')
    paper_title = list(g.triples((paper_ref, pred_title, None)))
    if paper_title:
        paper_title = paper_title[0][2]
        return paper_title
    return None

print("Paper title: '%s'" % get_title(paper_ref))


<a id="section-centrality"></a>
# 1. Centrality
*([back to ToC](#section-toc))*

Loosely speaking, **centrality** can be interpreted as a measure that helps identify the most influential nodes inside the network. [(Bonacich, 1987)](#cite-bonacich)

There is also a good introductory [video](https://www.youtube.com/watch?v=DfV-pjRTlLg) by Leonid Zhukov.

<a id="subsection-degree"></a>
## 1.1 Degree centrality
Degree centrality can be broken down into two measures:     
(1) The number of articles citing a specific article *a*. This is given by the **in-degree** of *a*.    
(2) The number of articles that *a* cites. This is given by the **out-degree** of *a*.   
     
Point (1) could also be interpreted as an indication of the **prestige** of *a*.

In [None]:
# Let's first have a look at the in-degree distribution
in_degs_arr = gt_graph.get_in_degrees(gt_graph.get_vertices())
# This shows us the node with the highest in-degree
highest_in_deg_v = np.where(in_degs_arr == in_degs_arr.max())[0][0]
paper_ref = rdflib.URIRef(vnames[str(highest_in_deg_v)]) 
print("Article with top prestige: '%s'" % (get_title(paper_ref)))
# Plot the distribution
plot_scatter(v_array, in_degs_arr, 'in-degree')

Note that the in-degree value does not necessarily correspond to the total number of articles citing the current paper. Instead, it reflects only the number of articles **within the considered partition of the knowledge graph**, which cite the current paper.         
For example, according to [Google Scholar](https://scholar.google.de/scholar?cites=9551782244184053582&as_sdt=2005&sciodt=0,5&hl=en) the paper with the highest in-degree (\*) was so far cited over 2300 times -- last checked on April 14, 2020 -- while in our network, it was cited "only" 26 times. However, as our network was constructed from the COVID-19 literature dataset, its citations are very recent and all citing articles are part of the same dataset. These are the reasons for the difference between the Google Scholar and our result. Please keep this in mind while interpreting this and the following data analysis.      
          
Nevertheless, it is exciting to see that with only a few lines of code we were able to distill lots of considerably impactful papers. At this point, we could easily proceed to revealing and analysing many properties of these particular papers. For instance, we could, very easily, view a list of all articles within our dataset that cite or are cited by ( \* ). Furthermore, using the knowledge graph predicates we can print a list of features associated with this paper, such as publisher, author or the concept it has within the COVID-19 context.

(\*): Van den Hoogen, B. G., de Jong, J. C., Groen, J., Kuiken, T., de Groot, R., Fouchier, R. A., & Osterhaus, A. D. (2001). *A newly discovered human pneumovirus isolated from young children with respiratory tract disease*. Nature medicine, 7(6), 719-724.

In [None]:
# Now, let's do the same for the out-degree
out_degs_arr = gt_graph.get_out_degrees(gt_graph.get_vertices())
highest_out_deg_v = np.where(out_degs_arr == out_degs_arr.max())[0][0]
paper_ref = rdflib.URIRef(vnames[str(highest_out_deg_v)]) 
print("Article citing the most: '%s'" %  (get_title(paper_ref)))
# Plot the distribution
plot_scatter(v_array, out_degs_arr, 'out-degree')

Although it looks unusual, the extensive reference list of the paper with the highest out-degree (\*) has indeed 236 entries.      
(\*): Mackay, I. M. (2008). *Human rhinoviruses: the cold wars resume*. Journal of Clinical Virology, 42(4), 297-320.

For completeness, we can have look at the total degree distribution, i.e. in-degree + out-degree. However, this time we will plot the results as histograms.

In [None]:
tot_degs_arr = gt_graph.get_total_degrees(gt_graph.get_vertices())
high_deg_v=np.where(tot_degs_arr == tot_degs_arr.max())[0][0]
paper_ref = rdflib.URIRef(vnames[str(high_deg_v)]) 
print("Article with the highest number of links: '%s'" % (get_title(paper_ref)))
# Plot the distribution 
# including an inset that shows the same distribution but on a log-log scale
plot_hist(tot_degs_arr, "total degree", inset=True)

In [None]:
# An alternative way of plotting the histogram in log-log scale
total_hist = gt.vertex_hist(gt_graph, "total")
plot_scatter(total_hist[1][:-1], total_hist[0], "node count", \
             xlabel="total degree", scale="log", xlim=[0.8, max(total_hist[1][:-1])+1e+2], ylim=[0.8, max(total_hist[0])+1e+3])

This is a skewed distribution, i.e. the vast majority of nodes have a total degree of 1, while there are a few nodes that have an extremely high degree.

Such a strongly skewed degree distribution hints at a potentially hierarchical network structure. To examine whether this is indeed the case here, it helps to visualize the network with node placement based on their distance from the most connected node, as in the plot below.

In [None]:
# The node with the highest degree is chosen as root
root_vertex=gt_graph.vertex(high_deg_v)
plot_graph(gt_graph, gt.radial_tree_layout(gt_graph, root_vertex))

We can see that there is indeed a strong tendency towards a hierarchical structure. Nevertheless, many nodes from diferent levels seem strongly interconnected.

<a id="subsection-closeness"></a>
## 1.2 Closeness centrality
In simple terms, nodes with highest closeness centrality are the best connected nodes *within their own clusters*. A cluster is a region within the network in which the nodes are more densely linked to each other than in the rest of the network.   

Thus, closeness centrality can indicate the **popularity** of a paper **within a certain group**. The more actively the papers reference each other within this group, the higher the closeness centrality of the nodes compared to the rest of the network. [(Bavelas, 1950)](#cite-bavelas)   

In [None]:
close_map_v=gt.closeness(gt_graph)
close_array_v=close_map_v.a
# The following line replaces 'nan' values with '0'
close_array_v = np.nan_to_num(close_array_v)
# Plot the distribution
plot_scatter(v_array, close_array_v, 'closeness')
# Which one is among the highest
high_close_v=np.where(close_array_v == close_array_v.max())[0][0]
paper_ref = rdflib.URIRef(vnames[str(high_close_v)]) 
print("Article with the highest closeness centrality is: '%s'" % (get_title(paper_ref)))
if not (not high_deg_v) and high_close_v==high_deg_v:
    print("Note that this is the same paper as the one with the highest prestige as determined by the degree centrality above.")

To make the matters a little easier, we mentioned above only the first paper from the list with highest closeness centrality. However, as the plot indicates, there are many papers with (equally) high closeness centrality.

Moreover, we have `close_array_v` at our disposal, which is the array of all the closeness values of the `gt_graph`. We can use these values for more detailed analysis.   
For example, we can plot the results in ascending order to get an intuition for their distribution.

In [None]:
close_array_v_sorted=np.sort(close_array_v)
# Plotting in ascending order, we can see that the fast majority of nodes has closeness centrality values close to zero
plot_scatter(v_array, close_array_v_sorted, 'closeness', xlabel='ids of sorted array')
# Uncomment the line below to zoom into the last 200 data points and see that the closeness increases partially super-linearly
#plot_scatter(v_array, close_array_v_sorted, 'closeness', xlabel='ids of sorted array', xlim=[len(close_array_v_sorted)-200,len(close_array_v_sorted)])

As the plot shows, the vast majority of nodes has zero closeness centrality, and a fairly small fraction of nodes has a closeness centrality close to 1. Hence, in the current network most nodes are rather loosely connected to a few highly connected ones.

<a id="subsection-betweenness"></a>
## 1.3 Betweenness centrality
Nodes with highest betweenness centrality act like connectors or joints. In the context of citation networks, one could consider them as *interdisciplinary* works bridging several disciplines. Furthermore, review articles often have high betweenness centrality.

Generally, the betweenness of a node *v* is a measure of how many (shortest) connectivity paths between node pairs go through *v*, divided by the number of all (shortest) paths.  [(Freeman 1977)](#cite-freeman)  

In [None]:
betw_map_v, betw_map_e=gt.betweenness(gt_graph)
betw_array_v=betw_map_v.a

plot_scatter(v_array, betw_array_v, 'betweenness', ylim=[min(betw_array_v)-1e-4,max(betw_array_v)+1e-4])

# Get one among the highest
high_betw_v=np.where(betw_array_v == betw_array_v.max())[0][0]
paper_ref = rdflib.URIRef(vnames[str(high_betw_v)]) 
print("Article with the highest betweenness centrality is: '%s'" % (get_title(paper_ref)))

All papers in our network seem to have a low betweenness centrality. This is not very surprising because in generating our network we sampled those works from the knowledge graph that were close (cited or were cited by) to a particular root paper. Therefore, our network contains works that are close to each other with respect to their scientific discipline.

Low betwenness across all papers could also suggest that there is only one scientific community represented in the network. Luckily, we can test this hypothesis by simply plotting the network in a different way. The graph below is drawn by placing the nodes proportionately to how strongly they are connected. Here, the strength of a connection is represented by virtual forces, inspired by springs [(Fruchterman, 1991)](#cite-fruchterman).

More importantly, we can scale the node sizes in relation to the betweenness.

In [None]:
# Get the betweenness values
vmap=betw_map_v.copy()
# Scale them for a more insightful visualization
vmap.a = 500 * (np.sqrt(vmap.a) + 0.005)
# Draw the graph
plot_graph(gt_graph, gt.fruchterman_reingold_layout(gt_graph), vmap=vmap)

It seems that there may indeed be two communities. However, we see also that there are many nodes that are so called leaves, i.e. nodes that have only one connection. Such connections reduce the betweenness values at the normalization. Nevertheless, as the network plot shows, if we compare the betweenness values relatively to each other, the measure is reasonably successful in identifying nodes that are close to the intersection between the two communities.

<a id="subsection-PageRank"></a>
## 1.4 PageRank
PageRank (an algorithm developed by the founders of Google) estimates the importance of a node *a* based on *a*'s *Katz centrality* but also taking into account the centrality of nodes pointing to *a*. With other words, if a very central or important paper is citing your paper then it give your paper a centrality bonus. [(Page, 1999)](#cite-page)  

In [None]:
pgrank_map=gt.pagerank(gt_graph)
pgrank_array=pgrank_map.a

plot_scatter(v_array, pgrank_array, 'pagerank', ylim=[min(pgrank_array)-1e-5,max(pgrank_array)+1e-5])

# Get one among the highest
high_pgrank_v=np.where(pgrank_array == pgrank_array.max())[0][0]
paper_ref = rdflib.URIRef(vnames[str(high_pgrank_v)]) 
print("Article with the highest PageRank centrality is: '%s'" % (get_title(paper_ref)))
if not (not highest_in_deg_v) and high_pgrank_v==highest_in_deg_v:
    print("Note that, coincidentally, this is the same paper as the one with the highest prestige as determined by the in-degree centrality above.")

The plot shows that the PageRank values are very low. Similar to the betweenness analysis above, it would be useful to plot the network and compare the values in relation to each other. 

<a id="subsection-hits"></a>
## 1.5 HITS
The Hyperlink-Induced Topic Search (HITS) is a measure for the authority and hub centralities of the nodes. This measure is often used in the field of citation networks. It is useful to estimate the authorities (papers that are cited a lot, often some influential, ground-breaking papers, written by famous experts) and hubs (papers that cite many other papers, for example reviews).   
This measure is conceptually somewhat similar to the degree centrality. The big differnce is that HITS assigns higher ranking to authorities that are cited by high-ranked hubs and vice versa. In this sense, it is a bit like a "smart degree centrality" [(Kleinberg, 1999)](#cite-kleinberg).   

In [None]:
hits_eig, auth_map_v, hubs_map_v=gt.hits(gt_graph)

auth_array_v=auth_map_v.a
hubs_array_v=hubs_map_v.a

fig = plt.figure(figsize=(8, 5.5))
    
plt.scatter(v_array, auth_array_v, color="#004080", edgecolors='black', alpha=0.75, label='hubs') 
plt.scatter(v_array, hubs_array_v, color="gray", edgecolors='black', alpha=0.75, label='authorities') 

# Unomment the line below to limit th y-axis values between 0.0 and 0.1
plt.ylim([0.0,0.1])

# Apply a few other plot settings
plt.xticks(fontsize=14)  
plt.yticks(fontsize=14)  
plt.xlabel('node ids', fontsize=16)  
plt.ylabel('authorities and hubs', fontsize=16)   
ax = plt.gca()  
ax.get_xaxis().tick_bottom()  
ax.get_yaxis().tick_left() 
plt.legend()
plt.show()

# Look at highest values
high_auth_v=np.where(auth_array_v == auth_array_v.max())[0][0]
paper_ref = rdflib.URIRef(vnames[str(high_auth_v)]) 
print("Article with the highest HITS-authority centrality is: '%s'" % (get_title(paper_ref)))
if not (not highest_in_deg_v) and high_auth_v==highest_in_deg_v:
    print("Note that, coincidentally, this is the same paper as the one with the highest prestige as determined by the in-degree centrality above.")
    
high_hubs_v=np.where(hubs_array_v == hubs_array_v.max())[0][0]
paper_ref = rdflib.URIRef(vnames[str(high_hubs_v)]) 
print("Article with the highest HITS-hubs centrality is: '%s'" % (get_title(paper_ref)))
if not (not highest_out_deg_v) and high_hubs_v==highest_out_deg_v:
    print("Note that, coincidentally, this is the same paper as the one with the highest prestige as determined by the out-degree centrality above.")

<a id="subsection-other"></a>
## 1.6 Other centrality measures
There are numerous other centrality measures that might reveal some interesting information about the relevance of certain nodes in the knowledge graph. The interested reader is referred to [(Landherr, 2010)](#cite-landherr).

<a id="section-clustering"></a>
# 2. Clustering
*([back to ToC](#section-toc))*

Clustering (or clustering *coefficient*) is an important measure in the present context as it reveals some information on how tightly some groups of articles or publications are interconnected through citations. Such groups often create *cliques*, where every paper cites every other paper within that group.

<a id="subsection-local"></a>
## 2.1 Local clustering
The local clustering coefficient is high when the neighbors of a node are close to forming a clique. It's called "local" because it focuses on the nearest neighbors of each node. [(Watts, 1998)](#cite-watts)     

In [None]:
loc_clust_map=gt.local_clustering(gt_graph)
loc_clust_array_v=loc_clust_map.a

plot_hist(loc_clust_array_v, 'local clustering coefficient', scale="log", xlim=[0.01,None])

# Get one among the highest
high_loc_clust_v=np.where(loc_clust_array_v == loc_clust_array_v.max())[0][0]
paper_ref = rdflib.URIRef(vnames[str(high_loc_clust_v)]) 
print("Article with the highest local clustering is: '%s'" % (get_title(paper_ref)))

In [None]:
# Again, we can plot a neat visualization to see where the local clustering is high
# Get the clustering values
vmap=loc_clust_map.copy()
# Scale them slightly for a more insightful visualization
vmap.a = 10 * (np.sqrt(vmap.a) + 0.25)
# Draw the graph
plot_graph(gt_graph, gt.sfdp_layout(gt_graph), vmap=vmap)

This looks intuitive, the clustering is high where nodes are tightly connected.

<a id="subsection-global"></a>
## 2.2 Global clustering
In simple terms, the local clustering coefficient is high when the entire network forms a large tightly knit group. It's called "global" because it focuses on the entire network. Thus, instead of obtaining values for each node separately, as in above measures, here we get one *global* value that corresponds to the whole network. [(Newman, 2003a)](#cite-newman-2003-a)   

In [None]:
glob_clust_mean=gt.global_clustering(gt_graph)
print("Global clustering coefficient: %f +- %f" % glob_clust_mean)

<a id="section-assortativity"></a>
# 3. Assortativity
*([back to ToC](#section-toc))*

This measure allows us to estimate whether there are preferential citing patterns, i.e. whether a paper with a certain degree configuration (i.e. number of citing and cited papers) is more likely to cite a paper with a similar degree configuration.   

As the global clustering coefficient, assortativity is a *global* measure. It can be computed for in-degrees, out-degrees and total degrees. [(Newman, 2003b)](#cite-newman-2003-b) 

In [None]:
glob_assort_in_mean=gt.assortativity(gt_graph, "out")
print("Assortativity of out-degrees: %.5f +- %.5f" % glob_assort_in_mean)
glob_assort_out_mean=gt.assortativity(gt_graph, "in")
print("Assortativity of in-degrees: %.5f +- %.5f" % glob_assort_out_mean)
glob_assort_total_mean=gt.assortativity(gt_graph, "total")
print("Assortativity of total degrees: %.5f +- %.5f" % glob_assort_total_mean)

The values are relatively low, suggesting that while citing a paper the authors are more likely to cite a paper that has a higher number of citations than their own paper. In the context of citation networks, this measure has rather the role of a sanity check. It seems clear that a new paper is very unlikely to have been already cited at the time of its publication.

<a id="section-combining"></a>
# 4. Combining insights from several measures for more in-depth analysis
*([back to ToC](#section-toc))*

Often, it is insightful to combine and/or compare the results from multiple measures.      
                  
For example, consider the paper *'Human rhinoviruses: the cold wars resume'* (DOI: http://dx.doi.org/10.1016/j.jcv.2008.04.002) that cites 236 other works, as we learned in Sec. 1.1. Given this information, it might be interesting to find out how close are these other works to each other within their respecitive area of expertise. With other words, how broadly are these works distributed in terms of research disciplines. As we mentioned in Sec. 1.3, a good indication for this is given by the *betweenness* centrality. We can get this information easily from the array obtained in Sec. 1.3.

In [None]:
# First, we query the index of the node that corresponds to the paper of interest
paper="http://dx.doi.org/10.1016/j.jcv.2008.04.002"
node_id=vnames_inv[paper]
# Now we simply draw the betweenness value at the array index = node_id
paper_betw=betw_array_v[node_id]
# And evaluate the value of this betweenness with respect to the mean of the entire betweenness array
print("Evaluated with respect to the mean: %s" % (paper_betw/(np.mean(betw_array_v))))
# However, as can be seen from the plot in Sec. 1.3, the betweenness values are considerably skewed (i.e. most are close to zero while a few have high values). 
# Thus, it would be more appropriate to evaluate the value of this betweenness with respect to the median of the entire betweenness array. However, the median is zero.
print(np.median(betw_array_v))
# Therefore, we compare it to the highest betweenness value
print("Betweenness centrality of the considered paper: %s" % paper_betw)
print("Highest betweenness centrality value within the network: %s" % betw_array_v.max())
# Thus, the betweenness centrality of the considered paper is higher than that of most papers in the network but by one order of magnitude lower than the highest betweenness value
print("Their quotient: %s" % (paper_betw/betw_array_v.max()))

So we can conclude: the above paper (\*) does a respectable job in referencing works from relatively diverse areas, although this diversity could be higher.    

(\*): Mackay, I. M. (2008). *Human rhinoviruses: the cold wars resume*. Journal of Clinical Virology, 42(4), 297-320.

<a id="section-outlook"></a>
# 5. Outlook
Network science offers a number powerful tools for data mining and data analysis. Here, we focused on citation networks generated from the knowledge graph data set. However, the knowledge graph contains much more information and network links can be created from other associations such as publishers, authors, location or even titles.

Moreover, there are other very interesting ways in which we can apply the *graph_tool* library, e.g. for Bayesian inference of network structure or for edge prediction, among others.

Finally, there are also other great network analysis tools for Python, such as [NetworkX](https://networkx.github.io/), or in other languages (e.g. [tidygraph](https://www.rdocumentation.org/packages/tidygraph) and [ggraph](https://www.rdocumentation.org/packages/ggraph) in R) and [Gephi](https://gephi.org/) for beautiful visualizations.

Hopefully, this kernel provided you with some insightful, introductory, examples of how knowledge graphs in combination with network science can shed some light into the complex world of scientific literature.

<a id="section-references"></a>
# References
*([back to ToC](#section-toc))*
<a id="cite-bavelas"></a>
* Bavelas, A. (1950). [*Communication patterns in task‐oriented groups.*](https://asa.scitation.org/doi/abs/10.1121/1.1906679?journalCode=jas) The journal of the acoustical society of America, 22(6), 725-730
<a id="cite-bonacich"></a>
* Bonacich, P. (1987). [*Power and centrality: A family of measures.*](https://www.journals.uchicago.edu/doi/abs/10.1086/228631) American journal of sociology, 92(5), 1170-1182.  
<a id="cite-freeman"></a>
* Freeman, L. C. (1977). [*A set of measures of centrality based on betweenness.*](https://www.jstor.org/stable/3033543?seq=1) Sociometry, 35-41. 
<a id="cite-fruchterman"></a>
* Fruchterman, Thomas M. J.; Reingold, Edward M. (1991) [*Graph Drawing by Force-Directed Placement.*](https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.4380211102) Software - Practice & Experience (Wiley) 21 (11): 1129-1164.
<a id="cite-kleinberg"></a>
* Kleinberg, J. M. (1999). [*Authoritative sources in a hyperlinked environment.*](https://dl.acm.org/doi/abs/10.1145/324133.324140) Journal of the ACM (JACM), 46(5), 604-632.
<a id="cite-landherr"></a>
* Landherr, A., Friedl, B., & Heidemann, J. (2010). [*A critical review of centrality measures in social networks.*](https://link.springer.com/article/10.1007/s12599-010-0127-3) Business & Information Systems Engineering, 2(6), 371-385. 
<a id="cite-newman-2003-a"></a>
* Newman, M. E. (2003). [*The structure and function of complex networks.*](https://epubs.siam.org/doi/abs/10.1137/s003614450342480) SIAM review, 45(2), 167-256. 
<a id="cite-newman-2003-b"></a>
* Newman, M. E. (2003). [*Mixing patterns in networks.*](https://journals.aps.org/pre/abstract/10.1103/PhysRevE.67.026126) Physical Review E, 67(2), 026126. 
<a id="cite-newman-book"></a>
* Newman, M. (2018). [*Networks.*](http://dx.doi.org/10.1093/oso/9780198805090.001.0001) Oxford university press. 
<a id="cite-peixoto"></a>
* Peixoto, T. P. (2014). [*The graph-tool python library.*](https://figshare.com/articles/graph_tool/1164194) figshare. 
<a id="cite-page"></a>
* Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). [*The pagerank citation ranking: Bringing order to the web.*](http://ilpubs.stanford.edu:8090/422/) Stanford InfoLab. 
<a id="cite-watts"></a>
* Watts, D. J., & Strogatz, S. H. (1998). [*Collective dynamics of ‘small-world’networks.*](https://www.nature.com/articles/30918) nature, 393(6684), 440. 