# Link Analysis. PageRank Algorithm

## Setup

In [1]:
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAppName('pagerank').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/17 08:12:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Algorithm

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. 

The basic structure on which the algorithm works best is a directed network. E.g, WWW can be represented as a huge directed network, where websites are the nodes and hyperlinks amongst the pages are the directed edges.

#### Example: 

Suppose, there are 5 websites: A, B, C, D and E, where A hyperlinks to B; B hyperlinks to D, E; C hyperlinks to B; D hyperlinks to A, C, E and lastly E hyperlinks to A. The directed network figure is given below.

![](https://miro.medium.com/max/1400/1*WePw-05wGpkamKVI3cfqew.png)

According to Google, the importance of a page means the number and quality of inward links of that page.

#### Steps to calculate PageRank centrality of each node

**Step 1:** Assign each node with an initial value of 1/n, where n is the number of nodes.

**Step 2:** For each node, `n(i)` in the network, find all the nodes, `r = [n(1),n(2),..]` being referred by n(i), and assign all those referred nodes as `r(i) = previous PageRank value of node n(i) / number of nodes being referred to by n(i)`, i.e, size of `r`

**Step 3:** Repeat Step 2 until convergence 

We can initialize the nodes with any value. The algorithm will converge always, irrespective of what values the nodes are initialized with. This is due to the principal eigenvector of the PageRank matrix which doesn’t depend on the initial values you assign each node with.

Next, let's work with `links.txt` file. 

In [18]:
# Adjacency list
links = sc.textFile('links.txt')
links.collect()

['A B', 'B A C', 'C B D', 'D C']

Here, `'A B'` means that A out-links to B. 

In [19]:
# Key/value pairs
links = links.map(lambda x: (x.split(' ')[0], x.split(' ')[1:]))
print('links:', links.collect())
 
# Find node count
N = links.count()
print('# of nodes:', N)

# Create and initialize the ranks
ranks = links.map(lambda node: (node[0], 1.0 / N))
print('ranks:', ranks.collect())

links: [('A', ['B']), ('B', ['A', 'C']), ('C', ['B', 'D']), ('D', ['C'])]
# of nodes: 4
ranks: [('A', 0.25), ('B', 0.25), ('C', 0.25), ('D', 0.25)]


In [21]:
iter = 10 # number of iterations
for i in range(iter):
    # (rank/(number of neighbors) 
    ranks = links.join(ranks).flatMap(lambda x : [(i, float(x[1][1])/len(x[1][0])) for i in x[1][0]])\
    .reduceByKey(lambda x,y: x+y)
    print(ranks.sortByKey().collect())

[('A', 0.125), ('B', 0.375), ('C', 0.375), ('D', 0.125)]
[('A', 0.1875), ('B', 0.3125), ('C', 0.3125), ('D', 0.1875)]
[('A', 0.15625), ('B', 0.34375), ('C', 0.34375), ('D', 0.15625)]
[('A', 0.171875), ('B', 0.328125), ('C', 0.328125), ('D', 0.171875)]
[('A', 0.1640625), ('B', 0.3359375), ('C', 0.3359375), ('D', 0.1640625)]
[('A', 0.16796875), ('B', 0.33203125), ('C', 0.33203125), ('D', 0.16796875)]
[('A', 0.166015625), ('B', 0.333984375), ('C', 0.333984375), ('D', 0.166015625)]
[('A', 0.1669921875), ('B', 0.3330078125), ('C', 0.3330078125), ('D', 0.1669921875)]
[('A', 0.16650390625), ('B', 0.33349609375), ('C', 0.33349609375), ('D', 0.16650390625)]


                                                                                

[('A', 0.166748046875), ('B', 0.333251953125), ('C', 0.333251953125), ('D', 0.166748046875)]


In [27]:
# list elements in descending order
ranks.sortBy(lambda x: x[1], ascending=False).collect()

[('B', 0.333251953125),
 ('C', 0.333251953125),
 ('A', 0.166748046875),
 ('D', 0.166748046875)]