# Centrality

In this notebook, we will compute vertex centrality scores using the various cuGraph algorithms.  We will then compare the similarities and differences.

| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware |
| --------------|------------|------------------|-----------------|----------------|
| Brad Rees     | 04/16/2021 | created          | 0.19            | GV100, CUDA 11.0
|               | 08/05/2021 | tested / updated | 21.10 nightly   | RTX 3090 CUDA 11.4
| Ralph Liu     | 06/22/2022 | test/update      | 22.08           | T100, Cuda 11.5

 

Centrality is measure of how important, or central, a node or edge is within a graph.  It is useful for identifying influencer in social networks, key routing nodes in communication/computer network infrastructures, 

The seminal paper on centrality is:  Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.


__Degree centrality__ – _done but needs an API_ <br>
Degree centrality is based on the notion that whoever has the most connections must be important.   

<center>
    Cd(v) = degree(v)
</center>

cuGraph currently does not have a Degree Centrality function call. However, since Degree Centrality is just the degree of a node, we can use _G.degree()_ function.
Degree Centrality for a Directed graph can be further divided in _indegree centrality_ and _outdegree centrality_ and can be obtained using _G.degrees()_


___Closeness centrality – coming soon___ <br>
Closeness is a measure of the shortest path to every other node in the graph.  A node that is close to every other node, can reach over other node in the fewest number of hops, means that it has greater influence on the network versus a node that is not close.

__Betweenness Centrality__ <br>
Betweenness is a measure of the number of shortest paths that cross through a node, or over an edge.  A node with high betweenness means that it had a greater influence on the flow of information.  

Betweenness centrality of a node 𝑣 is the sum of the fraction of all-pairs shortest paths that pass through 𝑣

<center>
    <img src="https://latex.codecogs.com/png.latex?c_B(v)&space;=\sum_{s,t&space;\in&space;V}&space;\frac{\sigma(s,&space;t|v)}{\sigma(s,&space;t)}" title="c_B(v) =\sum_{s,t \in V} \frac{\sigma(s, t|v)}{\sigma(s, t)}" />
</center>

To speedup runtime of betweenness centrailty, the metric can be computed on a limited number of nodes (randomly selected) and then used to estimate the other scores.  For this example, the graphs are relatively small (under 5,000 nodes) so betweenness on every node will be computed.

___Eigenvector Centrality - coming soon___ <br>
Eigenvectors can be thought of as the balancing points of a graph, or center of gravity of a 3D object.  High centrality means that more of the graph is balanced around that node.

__Katz Centrality__ <br>
Katz is a variant of degree centrality and of eigenvector centrality. 
Katz centrality is a measure of the relative importance of a node within the graph based on measuring the influence across the total number of walks between vertex pairs. 

<center>
    <img src="https://latex.codecogs.com/gif.latex?C_{katz}(i)&space;=&space;\sum_{k=1}^{\infty}&space;\sum_{j=1}^{n}&space;\alpha&space;^k(A^k)_{ji}" title="C_{katz}(i) = \sum_{k=1}^{\infty} \sum_{j=1}^{n} \alpha ^k(A^k)_{ji}" />
</center>

See:
* [Katz on Wikipedia](https://en.wikipedia.org/wiki/Katz_centrality) for more details on the algorithm.
* https://www.sci.unich.it/~francesc/teaching/network/katz.html

__PageRank__ <br>
PageRank is classified as both a Link Analysis tool and a centrality measure.  PageRank is based on the assumption that important nodes point (directed edge) to other important nodes.  From a social network perspective, the question is who do you seek for an answer and then who does that person seek.  PageRank is good when there is implied importance in the data, for example a citation network, web page linkages, or trust networks.    


### Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*


![Karate Club](../img/zachary_black_lines.png)


Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.

In [None]:
#  Import the modules
import cugraph
import cudf

# Import a built-in dataset
from cugraph.experimental.datasets import karate

In [None]:
import numpy as np
import pandas as pd   
from IPython.display import display_html 

### Functions
using underscore variable names to avoid collisions.  
non-underscore names are expected to be global names

In [None]:
# Compute Centrality
# the centrality calls are very straightforward with the graph being the primary argument
# we are using the default argument values for all centrality functions

def compute_centrality(_graph) :
    # Compute Degree Centrality
    _d = _graph.degree()
        
    # Compute the Betweenness Centrality
    _b = cugraph.betweenness_centrality(_graph)

    # Compute Katz Centrality
    _k = cugraph.katz_centrality(_graph)
    
    # Compute PageRank Centrality
    _p = cugraph.pagerank(_graph)
    
    return _d, _b, _k, _p

In [None]:
# Print function
# being lazy and requiring that the dataframe names are not changed versus passing them in
def print_centrality(_n):
    dc_top = dc.sort_values(by='degree', ascending=False).head(_n).to_pandas()
    bc_top = bc.sort_values(by='betweenness_centrality', ascending=False).head(_n).to_pandas()
    katz_top = katz.sort_values(by='katz_centrality', ascending=False).head(_n).to_pandas()
    pr_top = pr.sort_values(by='pagerank', ascending=False).head(_n).to_pandas()
    
    df1_styler = dc_top.style.set_table_attributes("style='display:inline'").set_caption('Degree').hide_index()
    df2_styler = bc_top.style.set_table_attributes("style='display:inline'").set_caption('Betweenness').hide_index()
    df3_styler = katz_top.style.set_table_attributes("style='display:inline'").set_caption('Katz').hide_index()
    df4_styler = pr_top.style.set_table_attributes("style='display:inline'").set_caption('PageRank').hide_index()

    display_html(df1_styler._repr_html_()+df2_styler._repr_html_()+df3_styler._repr_html_()+df4_styler._repr_html_(), raw=True)

## Create a Graph

In [None]:
# Create a graph using the imported Dataset object
G = karate.get_graph(fetch=True)

## Compute Centrality

In [None]:
dc, bc, katz, pr = compute_centrality(G)

### Results
Typically, analysts just look at the top 10% of results.  Basically just those vertices that are the most central or important.  
The karate data has 32 vertices, so let's round a little and look at the top 5 vertices

In [None]:
print_centrality(5)

### A Different Dataset
The Karate dataset is not that large or complex, which makes it a perfect test dataset since it is easy to visually verify results.  Let's look at a larger dataset with a lot more edges

In [None]:
# Import a different dataset object
from cugraph.experimental.datasets import netscience

In [None]:
G = netscience.get_graph(fetch=True)
(G.number_of_nodes(), G.number_of_edges())

In [None]:
dc, bc, katz, pr = compute_centrality(G)

In [None]:
print_centrality(5)

We can now see a larger discrepancy between the centrality scores and which nodes rank highest.
Which centrality measure to use is left to the analyst to decide and does require insight into the difference algorithms and graph structure.

### And One More Dataset
Let's look at a Cyber dataset.  The vertex ID are IP addresses

In [None]:
# Import a different dataset object
from cugraph.experimental.datasets import cyber

In [None]:
# Get the edgelist
gdf = cyber.get_edgelist(fetch=True)

# Create a Graph
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst')

In [None]:
(G.number_of_nodes(), G.number_of_edges())

In [None]:
dc, bc, katz, pr = compute_centrality(G)

In [None]:
print_centrality(5)

There are differences in how each centrality measure ranks the nodes. In some cases, every algorithm returns similar results, and in others, the results are different. Understanding how the centrality measure is computed and what edge represent is key to selecting the right centrality metric.

----
Copyright (c) 2019-2021, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.