Skip to content
An implementation of the PageRank algorithm in Hadoop MapReduce
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
input README & Input files Mar 11, 2018


An implementation of the Page Rank algorithm using Hadoop (Java). This was tested using the inputs (available above):

  • pagerank_data.txt - A testing example
  • - A dataset that can be found here.

Input Format

The input used in this implementation (inputs) is as follows: Note: Each line is 2 values seperated by a space.

  • First line: NodesCount EdgesCount (e.g. "5 9")
  • NC next lines: NodeID NodeURL (e.g. "1")
  • EC next lines: NodeID OutlinkToNodeID (e.g. "1 2")

Step 1

  • Mapper: Reads the initial input file, ignores the first line and the NC next lines and outputs the EC edges <IntWritable, IntWritable> (e.g. <1, 2>).
  • Reducer: Receives the EC edges (Each node and an Iterable of its outlinks) and outputs for each node a concatenated list of its outlinks prefixed with an initial page rank (separated by a tab) <IntWritable, Text> (e.g. <1, "1.0 3,2">).

Step 2

This is the most important step in the process. This is ran multiple times since the output of this step will be the same as in the first step.

  • Mapper: Reads the last output generated (wheither from the first step or from a previous iteration) and outputs for each [node, rank, "outlink1,outlink2,..."]:
    • For each outlink it outputs the outlink and rank / len(outlinks): <Text, Text> (e.g. <"3", "0.5">.
    • The node and its outlinks prefixed with an open bracket: <Text, Text> (e.g. <"1", "[3,2">).
  • Reducer: Receives for each node a list of values (the values can be either page ranks or the list of the outlinks), calculates the pagerank by using the formula (1 - d) + (d * sum(pageRank)) and outputs the same structure as the first step: <Text, Text> (e.g. <"1", "0.7875 3,2">).

This can be hard to understand using sentences, so here is a pseudo-code:

Map(Offset, Text):
	node, rank, outlinks = Parse(Text)
	if outlinks == None:

	for (outlink in outlinks):
		Write(outlink, rank / len(outlinks))

	Write(node, '[' + outlinks)

Reduce(Node, Text[]):
	outlinks = []
	totalRank = 0

	for (text in Text):
		if text.startswith('['):
			outlinks = text[1:]
			totalRank += text

	totalRank = (1 - 0.85) + (0.85 * totalRank)
	Write(Node, totalRank + '\t' + outlinks)

This step is ran X times (The X must be given in the arguments) or until a minimum score is met. The score is calculated as follows: ∑i(|lastRanks[i] - newRanks[i]|).

Step 3

This step basically only needs a mapper and the use of the shuffle&sort phase to print the rankings but can use a Reducer to, for example, print the top N ranked pages.

  • Setup: Before mapping, we read the input data (the NC nodes lines) and fill a HashMap with each node and its corresponding url, this is simply to be able to print the url of the pages in the ranking instead of just the node ids.
  • Mapper: Receives the last ranks output (node, rank, outlinks) and outputs for each: <FloatWritable, Text> (e.g. <1.6375, "">). We write the rank as the key so that the shuffle&sort phase (using a custom SortComparator) sorts our entries in a descending order (thus not needing a Reducer).


  1. Download the sources.
  2. Compile to a .jar file.
  3. Create an input folder and put pagerank_data.txt in there.
  4. Use the following command hadoop jar File.jar PageRank /input /output /input/pagerank_data.txt 0.85 5 true true:
    • /input: The input folder
    • /output: The output folder
    • /input/pagerank_data.txt: The data file (used to get the urls in step 3)
    • 0.85: The damping factor.
    • 5: The maximum number of iterations.
    • 0.01: The criterion to break from the iterations (minimum difference).
    • true: Delete the output folder before starting (if found).
    • true: Show the results at the end (/ranking/part-r-00000 file).
You can’t perform that action at this time.