# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
## Case Study : PageRank on MapReduce (ungraded)

## Learning Objectives


At the end of the case study, you will be able to

* Obtain a brief overview of MapReduce
* Understand the concept of Page Rank used by Google
* Implement Page Rank algorithm by using the approach of MapReduce:
  * Perform transformation operations using MapReduce
  * Calculate the rank of a webpage using pyspark and MapReduce

## Information

**Parallel Processing:-** To deal with massive data or Big data as it is popularly known. The idea is to reduce the run time, cost and memory constraints while dealing with this huge data.

In this process, the task is broken down into multiple parts with a help of a tool. Each part is assigned to a different computational unit for processing.
Once the computation is done by different units, the solution is re-assembled to give a final solution.

**Map Reduce:-** This is a programming model that allows the user to perform parallel processing on Big data.

To read more about MapReduce [Click Here](https://taylankabbani96.medium.com/mapreduce-programming-model-a7534aca599b)

### **Page Rank**

**A brief history:-** Early searching engines used to search through the web and create an inverted index of all the terms found in each page. While querying, an old search engine would return the pages that contain the term the user entered and rank the pages based on the frequency of occurrence of the term entered by the user, in each page.

The naivety of this simple approach, gave spammers an opportunity to exploit the searching engines. By exploiting it, they could lead people to their pages (spam).

  - **For example:-** A spammer would write an irrelevant term multiple times, say "movie", where the page actually sells clothes!!
To prevent the term "movie" from appearing on the spam page, they could give it the same color as the background.
By using this technique, spammers managed to put their pages on the top of the search results of the search engine.This paralyzed Web searching engines and made them almost useless.

  To prevent those spam pages from having so much influence, Google proposed **Page Rank** as a way to define the importance of a page. It is not wrong to say that this algorithm is what made **Google** then a very powerful engine.

### What is Page Rank?

**Page rank or PR**, is a recursive algorithm developed by Google, founded by Larry Page to assign a real number to each page in the web so that they can be ranked based on this score.
The higher the score of a page, the more importance assigned to it.

**Importance:-** The importance score of a page directly depends on the number of other pages pointing to it.

To understand the concept of pointing, suppose if a movie-page’s link is being displayed on many other pages, it is said that these pages are pointing or voting to that particular movie-page, and thus:

**Understanding "Importance" with the help of an example:-**


How does somebody decide if a person on twitter is important or not?
  1. The first thing to check is the number of followers, the more followers a particular person has, the higher likelihood of that person being famous.
  2. The second step is to check the importance of that person’s followers, if the president for example is following him/her, the higher the importance of that person.

### **Page Rank Algorithm**

Consider a tiny version of the web with only five pages. To rank these pages based on their importance using page rank algorithm. Consider the graph to understand the concept, where nodes represent pages and arrows represent links between pages.


To calculate the importance score of page C, let R denote the importance score of a page, then the importance score of the page C can be calculated as:
$R_{c} = R_{B}/3 + R_{A}/4$

Page C’s own importance is the sum of the votes on its in-links, and if page $A$ with importance $R_{A}$ has $n$ out-links, each link gets $R_{A}/n$ votes.

![img](https://miro.medium.com/max/526/1*Mgnh6M7DUhJIuO1_btIU1A.png)

**Steps Involved**

1. Initialize each page’s importance or rank as 1/number of pages.
2. Update each page’s rank according to the following formulation:
   $$R_j = \sum_{i}^{j} \frac{r_i}{d_i}$$
   

3. Repeat above steps until the page ranks stabilize.

### Implementation of Page Rank using MapReduce


The number of pages on the Web is enormously huge and using a simple approach to recursively update ranking of millions of pages will be very expensive and time consuming as as well. MapReduce tackles the problem by taking the advantage of running on a cluster (parallelization)

### Installing the pyspark package

In [None]:
!pip install pyspark

### Creating a spark session

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf  # User Defined Functions
from pyspark.sql.types import StringType
spark = SparkSession.builder.appName('MapReduce').getOrCreate()
spark

In [None]:
# Accessing sparkContext from sparkSession instance.
sc = spark.sparkContext

### Creating links.txt file

An adjacency list representation, info: where ‘A B C’ means that A node (page) out-links to B & C.




Consider this as a mini internet where A, B, C, D are web pages.

![img](https://miro.medium.com/max/344/1*7C9YQPLxjVk_oGlnOq185w.png)

In [None]:
# Create/Open the text file
f = open("links.txt","w+")

# Enter data into the file
lst = ["A B C", "B C D", "C D", "D A"]          # links
for i in lst:
    f.write(i + "\r\n")

# Close the file instance
f.close()

### Open the links.txt file back and read the contents

In [None]:
f = open("links.txt", "r")
if f.mode == 'r':
    contents = f.read()
    print(contents)

In [None]:
# Adjacency list
links = sc.textFile('links.txt')
links.collect()

In [None]:
# Key/value pairs: the key is the name of the page and the value consists of out-links from the page
links = links.map(lambda x: (x.split(' ')[0], x.split(' ')[1:]))
print(links.collect())

# Find node count
N = links.count()
print(N)

# İnitiate PageRank values (ranks) as 1/(number of pages).
ranks = links.map(lambda node: (node[0],1.0/N))
print(ranks.collect())

### Perform Map and Reduce steps

In [None]:
# Map: For each node i, calculate vote for each out-link of i and propagate to adjacent nodes
# Reduce: For each node i, sum the upcoming votes and update the Rank value
# Repeat this map reduce step until rank values converge

ITERATIONS=20
for i in range(ITERATIONS):
    # Join graph info with rank info and propogate to all neighbors rank scores (rank/(number of neighbors)
    # And add up ranks from all in-coming edges
    ranks = links.join(ranks).flatMap(lambda x : [(i, float(x[1][1])/len(x[1][0])) for i in x[1][0]])\
    .reduceByKey(lambda x,y: x+y)
    print(ranks.sortByKey().collect())

* It is evident that out of four dummy web pages under consideration, D has the highest importance or rank.
* Web page B has the lowest rank.
* It will be best to trust web page D rather than B, for information on it

**References**:
1. Taylan Kabbani, 2020. PageRank on MapReduce. *Medium*.