# Chapter 5 | Centrality Algorithms
Centrality algorithms are used to understand the roles of particular nodes in a graph and their impact on that network. They’re useful because they identify the most important nodes and help us understand group dynamics such as credibility, accessibility, the speed at which things spread, and bridges between groups.

|                               Algorithm type                              |                                                 What it does                                                |                                                                     Examples                                                                     |   |
|:-------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------|---|
| Degree Centrality                                                         | Measures the number of relationships a node has                                                             | Estimating a person’s popularity by looking at their in-degree and using their out-degree to estimate gregariousness                             |   |
| Closeness Centrality Variations: Wasserman and Faust, Harmonic Centrality | Calculates which nodes have the shortest paths to all other nodes                                           | Finding the optimal location of new public services for maximum accessibility                                                                    |   |
| Betweenness Centrality Variation: Randomized-Approximate Brandes          | Measures the number of shortest paths that pass through a node                                              | Improving drug targeting by finding the control genes for specific diseases                                                                      |   |
| PageRank Variation: Personalized PageRank                                 | Estimates a current node’s importance from its linked neighbors and their neighbors (popularized by Google) | Finding  the most influential features for extraction in machine learning and  ranking text for entity relevance in natural language processing. |   |

## The Social Graph : Importing the Data into Apache Spark

In [1]:
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName('centrality').getOrCreate() 
from code.script.centrality import *
g = GraphFrame(
        spark.read.csv(op.join(data_path, 'social-nodes.csv'), header=True), 
        spark.read.csv(op.join(data_path, 'social-relationships.csv'), header=True)
    )

## Degree Centrality
Counts the number of incoming and outgoing relationships from a node, and is used to find popular nodes in a graph.

### Reach
Understanding the reach of a node is a fair measure of importance. How many other nodes can it touch right now? The degree of a node is the number of direct relationships it has, calculated for in-degree and out-degree. You can think of this as the immediate reach of node. For example, a person with a high degree in an active social network would have a lot of immediate contacts and be more likely to catch a cold circulating in their network.

### When Should I Use Degree Centrality?
Use Degree Centrality if you’re attempting to analyze influence by looking at the number of incoming and outgoing relationships, or find the “popularity” of individual nodes. It works well when you’re concerned with immediate connectedness or near-term probabilities. However, Degree Centrality is also applied to global analysis when you want to evaluate the minimum degree, maximum degree, mean degree, and standard deviation across the entire graph.
### Degree Centrality with Apache Spark

In [2]:
total_degree = g.degrees
in_degree = g.inDegrees
out_degree = g.outDegrees

(total_degree.join(in_degree, "id", how="left")
 .join(out_degree, "id", how="left")
 .fillna(0)
 .sort("inDegree", ascending=False)
 .toPandas())

Unnamed: 0,id,degree,inDegree,outDegree
0,Doug,6,5,1
1,Alice,7,3,4
2,Michael,5,2,3
3,Bridget,5,2,3
4,Amy,1,1,0
5,Charles,2,1,1
6,Mark,3,1,2
7,David,2,1,1
8,James,1,0,1


## Closeness Centrality
Closeness Centrality is a way of detecting nodes that are able to spread information efficiently through a subgraph.
The measure of a node’s centrality is its average farness (inverse distance) to all other nodes. Nodes with a high closeness score have the shortest distances from all other nodes.

\begin{equation*}
C(u) = {\frac{1}{\sum_{v=1}^{n-1} d(u,v)}}
\end{equation*}

\begin{equation*}
C_{norm}(u) = {\frac{n-1}{\sum_{v=1}^{n-1} d(u,v)}}
\end{equation*}

- *u* is a node
- *N* is the total node count
- *n* is the number of nodes in the same component as *u*
- *d(u,v)* is the shortest-path distance between anohter node *v* and *u* 

### When Should I Use Closeness Centrality?
Apply Closeness Centrality when you need to know which nodes disseminate things the fastest. Using weighted relationships can be especially helpful in evaluating interaction speeds in communication and behavioral analyses.

### Closeness Centrality with Apache Spark

In [3]:
g2 = GraphFrame(AM.getCachedDataFrame(g.vertices.withColumn("ids", F.array())), g.edges)

for i in range(0, g2.vertices.count()):
    new_vertices = (g2.vertices.join(g2.aggregateMessages(F.collect_set(AM.msg).alias("agg"),                               sendToSrc=F.udf(new_paths, ArrayType(    StructType([StructField("id", StringType()), StructField("distance", IntegerType())])))(AM.dst["ids"], AM.dst["id"]), sendToDst=F.udf(new_paths, ArrayType(    StructType([StructField("id", StringType()), StructField("distance", IntegerType())])))(AM.src["ids"], AM.src["id"])).withColumn("newIds", F.udf(flatten, ArrayType(    StructType([StructField("id", StringType()), StructField("distance", IntegerType())])))("agg")).drop("agg"), on="id", how="left_outer")                    .withColumn("mergedIds", F.udf(merge_paths, ArrayType(    StructType([StructField("id", StringType()), StructField("distance", IntegerType())])))("ids", "newIds", "id"))                    .drop("ids", "newIds")                    .withColumnRenamed("mergedIds", "ids"))
    cached_new_vertices = AM.getCachedDataFrame(new_vertices)
    g2 = GraphFrame(cached_new_vertices, g2.edges)


In [4]:
(g2.vertices
 .withColumn("closeness", F.udf(calculate_closeness, DoubleType())("ids"))
 .sort("closeness", ascending=False)
 .toPandas())

Unnamed: 0,id,ids,closeness
0,Doug,"[(Charles, 1), (Mark, 1), (Alice, 1), (Bridget, 1), (Michael, 1)]",1.0
1,Alice,"[(Charles, 1), (Mark, 1), (Bridget, 1), (Doug, 1), (Michael, 1)]",1.0
2,David,"[(James, 1), (Amy, 1)]",1.0
3,Michael,"[(Charles, 2), (Mark, 2), (Alice, 1), (Doug, 1), (Bridget, 1)]",0.714286
4,Bridget,"[(Charles, 2), (Mark, 2), (Alice, 1), (Doug, 1), (Michael, 1)]",0.714286
5,James,"[(Amy, 2), (David, 1)]",0.666667
6,Amy,"[(James, 2), (David, 1)]",0.666667
7,Charles,"[(Bridget, 2), (Mark, 2), (Michael, 2), (Doug, 1), (Alice, 1)]",0.625
8,Mark,"[(Bridget, 2), (Charles, 2), (Michael, 2), (Doug, 1), (Alice, 1)]",0.625


_This score represents the closeness of each user to others within their subgraph but not the entire graph._

### Closeness Centrality Variation: Wasserman and Faust
Stanley Wasserman and Katherine Faust came up with an improved formula for calculating closeness for graphs with multiple subgraphs without connections between those groups. Details on their formula are in their book, Social Network Analysis: Methods and Applications. The result of this formula is a ratio of the fraction of nodes in the group that are reachable to the average distance from the reachable nodes.

[Do Neo4j Example or write up your own function for Spark]

\begin{equation*}
C_{WF}(u) = \frac{n-1}{N-1}({\frac{n-1}{\sum_{v=1}^{n-1} d(u,v)})}
\end{equation*}

### Closeness Centrality Variation: Harmonic Centrality
Harmonic Centrality (also known as Valued Centrality) is a variant of Closeness Centrality, invented to solve the original problem with unconnected graphs

\begin{equation*}
H(u) = {\sum_{v=1}^{n-1}}{\frac{1}{d(u,v)}}
\end{equation*}

\begin{equation*}
H_{norm}(u) = \frac{{\sum_{v=1}^{n-1}}{\frac{1}{d(u,v)}}}{n-1}
\end{equation*}

## Betweenness Centrality
Betweenness Centrality is a way of detecting the amount of influence a node has over the flow of information or resources in a graph. It is typically used to find nodes that serve as a bridge from one part of a graph to another.
The Betweenness Centrality algorithm first calculates the shortest (weighted) path between every pair of nodes in a connected graph. Each node receives a score, based on the number of these shortest paths that pass through the node. The more shortest paths that a node lies on, the higher its score.

#### Bridges and control points
A bridge in a network can be a node or a relationship. In a very simple graph, you can find them by looking for the node or relationship that, if removed, would cause a section of the graph to become disconnected.

A node is considered *pivotal* for two other nodes if it lies on every shortest path between those nodes,if you remove a pivotal node, the new shortest path for the original node pairs will be longer or more costly. This can be a consideration for evaluating single points of vulnerability.

#### Calculating betweenness centrality

\begin{equation*}
B(u) = {\sum_{s\neq{u}\neq{t}}}{\frac{p(u)}{p}}
\end{equation*}

- *u* is a node
- *p* is the total number of shortest paths between nodes s and t
- *p(u)* is the number of shortest paths between nodes *s* and *t* that pass through node *u*

### When Should I Use Betweenness Centrality?
Betweenness Centrality applies to a wide range of problems in real-world networks.
We use it to find bottlenecks, control points, and vulnerabilities.

### Betweenness Centrality Variation: Randomized-Approximate Brandes

The Randomized-Approximate Brandes (RA-Brandes for short) algorithm is the best-known algorithm for calculating an approximate score for betweenness centrality. Rather than calculating the shortest path between every pair of nodes, the RABrandes algorithm considers only a subset of nodes. Two common strategies for selecting the subset of nodes are:

#### Random

Nodes are selected uniformly, at random, with a defined probability of selection. The
default probability is: $\frac{log10(N)}{e^2}$

#### Degree
Nodes are selected randomly, but those whose degree is lower than the mean are automatically excluded (i.e., only nodes with a lot of relationships have a chance of being visited).

## PageRank
Pagerank measures the transitive (or directional) influence of nodes. All the other centrality algorithms discussed measure the direct influence of a node, whereas PageRank considers the influence of a node’s neighbors, and their neighbors.

The basic assumption is that a page with more incoming and more influential incoming links is more likely a credible source. PageRank measures the number and quality of incoming relationships to a node to determine an estimation of how important that node is. Nodes with more sway over a network are presumed to have more incoming relationships from other influential nodes.

### Influence
relationships to more important nodes contribute more to the influence of the node in question than equivalent connections to less important nodes Measuring influence usually involves scoring nodes, often with weighted relationships, and then updating the scores over many iterations. Sometimes all nodes are scored, and sometimes a random selection is used as a representative distribution.

### The Page Rank Formula

\begin{equation*}
PR(u) = (1-d)+d(\frac{PR(T1)}{C(T1)}+...+\frac{PR(Tn)}{C(Tn)})
\end{equation*}

- Assume a page *u* has citations from pages *T1* to *Tn*
- *d* is a damping factor set between 0 and 1 (usually 0.85)
- *1-d* is the probability that a node is reached directly without following any relationships
- *C(Tn)* is defined as the out-degree of a node T.

### Iteration, Random Surfers, and Rank Sinks
PageRank is an iterative algorithm that runs either until scores converge or until a set number of iterations is reached.

A node, or group of nodes, without outgoing relationships (also called a dangling node) can monopolize the PageRank score by refusing to share. This is known as a *rank sink*. *Teleportation* used to overcome dead ends.

Circular references cause an increase in their ranks as the surfer bounces back and forth among the nodes. A damping factor is used to introduce random node visits.

If you see unexpected results from PageRank, it is worth doing some exploratory analysis of the graph to see if any of these problems are the cause.

### When Should I Use PageRank?
Use this algorithm whenever you’re looking for broad influence over a network. For instance, if you’re looking to target a gene that has the highest overall impact to a biological function, it may not be the most connected one. It may, in fact, be the gene with the most relationships with other, more significant functions.

### PageRank with Apache Spark
#### PageRank with a fixed number of iterations

In [5]:
results = g.pageRank(resetProbability=0.15, maxIter=20)
results.vertices.sort("pagerank", ascending=False).toPandas()

Unnamed: 0,id,pagerank
0,Doug,2.286537
1,Mark,2.142448
2,Alice,1.520331
3,Michael,0.727443
4,Bridget,0.727443
5,Charles,0.521385
6,Amy,0.509714
7,David,0.366558
8,James,0.19814


#### PageRank until convergence

In [6]:
results = g.pageRank(resetProbability=0.15, tol=0.01)
results.vertices.sort("pagerank", ascending=False).toPandas()

Unnamed: 0,id,pagerank
0,Doug,2.223319
1,Mark,2.090451
2,Alice,1.505629
3,Michael,0.733739
4,Bridget,0.733739
5,Amy,0.559447
6,Charles,0.533881
7,David,0.402323
8,James,0.217472


#### PageRank Variation: Personalized PageRank
Variant of the PageRank algorithm that calculates the importance of nodes in a graph from the perspective of a specific node. For PPR, random jumps refer back to a given set of starting nodes.

Creates bias and and localization towards the start nodes making PPR useful for highly targeted recommendations.

#### Personalized PageRank with Apache Spark

In [7]:
me = "Doug"
results = g.pageRank(resetProbability=0.15, maxIter=20, sourceId=me)
people_to_follow = results.vertices.sort("pagerank", ascending=False)

already_follows = list(g.edges.filter(f"src = '{me}'").toPandas()["dst"])
people_to_exclude = already_follows + [me]

people_to_follow[~people_to_follow.id.isin(people_to_exclude)].toPandas()

Unnamed: 0,id,pagerank
0,Alice,0.165018
1,Michael,0.048842
2,Bridget,0.048842
3,Charles,0.034978
4,David,0.0
5,James,0.0
6,Amy,0.0
