# Centrality

In this notebook, we will compute the various centrality for both vertices and edges using cuGraph and NetworkX, and then comparse the similarities and differences. The NetworkX and cuGraph processes will be interleaved so that each step can be compared.

| Author Credit |    Date    |  Update      | cuGraph Version |  Test Hardware |
| --------------|------------|--------------|-----------------|----------------|
| Brad Rees     | 04/16/2021 | created      | 0.19            | GV100, CUDA 11.0



Centrality is measure of how important, or central, a node or edge is with a graph.   It is useful for identifying influencer in social networks, key routing nodes in communication/computer network infrastructures, 

The seminal paper on centrality is:  Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.


__Degree centrality – coming soon but there is a work around__ <br>
Degree centrality is based on the notion that whoever has the most connects must be important.   
We can use the degree function until a centrality wrapper is added
<center>
    Cd(v) = degree(v)
</center>

Since Degree Centrality is just the degree of a vertex, we can use _G.degree()_ function.
Degree Centrality for a Directed graph can be further divided in _indegree centrality_ and _outdegree centrality_ and can be obtained using _G.degrees()_


__Closeness centrality – coming soon__ <br>
Closeness is a measure of the shortest path to every other node in the graph.  A node that is close to every other node means that it has greater influence on the network over a node that is not close (high centrality score)

__Betweenness Centrality__ <br>
Betweenness is a measure of the number of shortest paths that cross through a vertex, or over an edge.  A node with high betweenness means that it controls the flow of more information.  

Betweenness centrality of a node 𝑣 is the sum of the fraction of all-pairs shortest paths that pass through 𝑣

<center>
    <img src="https://latex.codecogs.com/png.latex?c_B(v)&space;=\sum_{s,t&space;\in&space;V}&space;\frac{\sigma(s,&space;t|v)}{\sigma(s,&space;t)}" title="c_B(v) =\sum_{s,t \in V} \frac{\sigma(s, t|v)}{\sigma(s, t)}" />
</center>

To speedup runtime of betweenness centrailty, the metric can be computed for a limited number of seed and then used to estimate the other scores.  For this example, the graphs are relatively smalled (under 5,000 vertices) so betweenness on every vertice will be computed.

__Eigenvector Centrality - comming soon__ <br>
I like to think of Eigenvectors as the balancing points of a graph, or center of gravity of a 3D object.  High centrality means that more of the graph is balanced around that node.

__Katz Centrality__ <br>
Katz is a variant of degree centrality and of eigenvector centrality. 
Katz centrality is a measure of the relative importance of a vertex within the graph based on measuring the influence across the total number of walks between vertex pairs. 

<center>
    <img src="https://latex.codecogs.com/gif.latex?C_{katz}(i)&space;=&space;\sum_{k=1}^{\infty}&space;\sum_{j=1}^{n}&space;\alpha&space;^k(A^k)_{ji}" title="C_{katz}(i) = \sum_{k=1}^{\infty} \sum_{j=1}^{n} \alpha ^k(A^k)_{ji}" />
</center>

See:
* [Katz on Wikipedia](https://en.wikipedia.org/wiki/Katz_centrality) for more details on the algorithm.
* https://www.sci.unich.it/~francesc/teaching/network/katz.html

__PageRank Centrality__ <br>
PageRank is classified as both a Link Analysis tool and a centrality measure.  PageRank is based on the assumption that important nodes point (directed edge) to other important nodes.  From a social network perspective, the question is who do you seek for an answer and then who does that person seek.   


### Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*


![Karate Club](../img/zachary_black_lines.png)


Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.

In [None]:
#  Import the modules
import cugraph
import cudf

In [None]:
import numpy as np
import pandas as pd   
from IPython.display import display_html 

### Functions

In [None]:
# Compute Centrality
# the centrality calls are very straight forwards with the graph being the primary argument
def compute_centrality(_graph) :
    # Compute Degree Centrality
    dc = _graph.degree()
        
    # Comnpute the Betweenness Centrality
    bc = cugraph.betweenness_centrality(_graph)

    # Compute Katz Centrality
    katz = cugraph.katz_centrality(_graph)
    
    # Compute PageRank Centrality
    pr = cugraph.pagerank(_graph)
    
    return dc, bc, katz, pr

In [None]:
# Print function
# being lazy and requiring that the dataframe names are not changed
def print_centrality(n):
    dc_top = dc.sort_values(by='degree', ascending=False).head(n).to_pandas()
    bc_top = bc.sort_values(by='betweenness_centrality', ascending=False).head(n).to_pandas()
    katz_top = katz.sort_values(by='katz_centrality', ascending=False).head(n).to_pandas()
    pr_top = pr.sort_values(by='pagerank', ascending=False).head(n).to_pandas()
    
    df1_styler = dc_top.style.set_table_attributes("style='display:inline'").set_caption('Degree').hide_index()
    df2_styler = bc_top.style.set_table_attributes("style='display:inline'").set_caption('Betweenness').hide_index()
    df3_styler = katz_top.style.set_table_attributes("style='display:inline'").set_caption('Katz').hide_index()
    df4_styler = pr_top.style.set_table_attributes("style='display:inline'").set_caption('PageRank').hide_index()

    display_html(df1_styler._repr_html_()+df2_styler._repr_html_()+df3_styler._repr_html_()+df4_styler._repr_html_(), raw=True)

## Read the data

In [None]:
# Define the path to the test data  
datafile='../data/karate-data.csv'

cuGraph does not do any data reading or writing and is dependent on other tools for that, with cuDF being the preferred solution.   

The data file contains an edge list, which represents the connection of a vertex to another.  The `source` to `destination` pairs is in what is known as Coordinate Format (COO).  In this test case, the data is just two columns.  However a third, `weight`, column is also possible

In [None]:
gdf = cudf.read_csv(datafile, delimiter='\t', names=['src', 'dst'], dtype=['int32', 'int32'] )

it was that easy to load data

## Create a Graph

In [None]:
# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe 
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst')

## Compute Centrality

In [None]:
dc, bc, katz, pr = compute_centrality(G)

### Results
Typically, analyst look just at the top 10% of results.  Basically just those vertices that are the most central or important.  
The karate data has 32 vertices, so let's round a little and look at the top 5 vertices

In [None]:
print_centrality(5)

### A Different Dataset
The Karate dataset is not that large or complex, which makes it a perfect test dataset since it is easy to visually verufy results.  Let's look at a larger dataset with a lot ore edges

In [None]:
# Define the path to the test data  
datafile='../data/netscience.csv'

gdf = cudf.read_csv(datafile, delimiter=' ', names=['src', 'dst', 'wt'], dtype=['int32', 'int32', 'float'] )

In [None]:
# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe 
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst')

In [None]:
(G.number_of_nodes(), G.number_of_edges())

In [None]:
dc, bc, katz, pr = compute_centrality(G)

In [None]:
print_centrality(5)

We can now see a large discrepancy in which each algorithm thinks is the most important vertex.
Which one is really the most central is left to the analyst to decied and does require insight into the difference in the centrality algorithms.

Copyright (c) 2021, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.