Skip to content
An implementation of the PageRank algorithm in Hadoop MapReduce
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.settings
.vscode
input README & Input files Mar 11, 2018
src
.classpath
.gitattributes
.gitignore
.project
LICENSE
README.md

README.md

PageRank

An implementation of the Page Rank algorithm using Hadoop (Java). This was tested using the inputs (available above):

  • pagerank_data.txt - A testing example
  • hollins.data - A dataset that can be found here.

Input Format

The input used in this implementation (inputs) is as follows: Note: Each line is 2 values seperated by a space.

  • First line: NodesCount EdgesCount (e.g. "5 9")
  • NC next lines: NodeID NodeURL (e.g. "1 http://example.com")
  • EC next lines: NodeID OutlinkToNodeID (e.g. "1 2")

Step 1

  • Mapper: Reads the initial input file, ignores the first line and the NC next lines and outputs the EC edges <IntWritable, IntWritable> (e.g. <1, 2>).
  • Reducer: Receives the EC edges (Each node and an Iterable of its outlinks) and outputs for each node a concatenated list of its outlinks prefixed with an initial page rank (separated by a tab) <IntWritable, Text> (e.g. <1, "1.0 3,2">).

Step 2

This is the most important step in the process. This is ran multiple times since the output of this step will be the same as in the first step.

  • Mapper: Reads the last output generated (wheither from the first step or from a previous iteration) and outputs for each [node, rank, "outlink1,outlink2,..."]:
    • For each outlink it outputs the outlink and rank / len(outlinks): <Text, Text> (e.g. <"3", "0.5">.
    • The node and its outlinks prefixed with an open bracket: <Text, Text> (e.g. <"1", "[3,2">).
  • Reducer: Receives for each node a list of values (the values can be either page ranks or the list of the outlinks), calculates the pagerank by using the formula (1 - d) + (d * sum(pageRank)) and outputs the same structure as the first step: <Text, Text> (e.g. <"1", "0.7875 3,2">).

This can be hard to understand using sentences, so here is a pseudo-code:

Map(Offset, Text):
	node, rank, outlinks = Parse(Text)
	
	if outlinks == None:
		return

	for (outlink in outlinks):
		Write(outlink, rank / len(outlinks))

	Write(node, '[' + outlinks)

Reduce(Node, Text[]):
	outlinks = []
	totalRank = 0

	for (text in Text):
		if text.startswith('['):
			outlinks = text[1:]
		else
			totalRank += text

	totalRank = (1 - 0.85) + (0.85 * totalRank)
	Write(Node, totalRank + '\t' + outlinks)

This step is ran X times (The X must be given in the arguments) or until a minimum score is met. The score is calculated as follows: ∑i(|lastRanks[i] - newRanks[i]|).

Step 3

This step basically only needs a mapper and the use of the shuffle&sort phase to print the rankings but can use a Reducer to, for example, print the top N ranked pages.

  • Setup: Before mapping, we read the input data (the NC nodes lines) and fill a HashMap with each node and its corresponding url, this is simply to be able to print the url of the pages in the ranking instead of just the node ids.
  • Mapper: Receives the last ranks output (node, rank, outlinks) and outputs for each: <FloatWritable, Text> (e.g. <1.6375, "http://www.hollins.edu/">). We write the rank as the key so that the shuffle&sort phase (using a custom SortComparator) sorts our entries in a descending order (thus not needing a Reducer).

Testing

  1. Download the sources.
  2. Compile to a .jar file.
  3. Create an input folder and put pagerank_data.txt in there.
  4. Use the following command hadoop jar File.jar PageRank /input /output /input/pagerank_data.txt 0.85 5 true true:
    • /input: The input folder
    • /output: The output folder
    • /input/pagerank_data.txt: The data file (used to get the urls in step 3)
    • 0.85: The damping factor.
    • 5: The maximum number of iterations.
    • 0.01: The criterion to break from the iterations (minimum difference).
    • true: Delete the output folder before starting (if found).
    • true: Show the results at the end (/ranking/part-r-00000 file).
You can’t perform that action at this time.