GitHub - wangzuo/pagerank-hadoop: My PageRank implementation with Hadoop

Introduction

The program implments a PageRank iterative calculation on a graph with Hadoop. The input data is a directed network (not html documents). There are mainly two phases of map-reduce. The second phase is iterative.

Usage

A bash script pr is provided. set $HADOOP_INSTALL to the directory of hadoop installation.

$ ./pr 1
# run phase 1, input folder: graph_input, output folder: phase1_output
$ ./pr 2 [iteration]
# run phase 2 for [iterative] times, input folder: phase2_output_[i], output_folder: phase2_output_[i+1]

About PageRank (review)

Overall: The ranking of a page is determined by its estimated importance (determined by link structure) instead of by its content. A variation of Eigenvector centrality.

each iteration: PageRank divided by the number of outbound links L(j) probability distribution: the bigger the PageRank, the more important the page d = 0.85 introduction to PageRank

PageRank Convergence: It may never reach convergence for loops. Ususally 10 iterations will be enough.

Computing PageRank using MapReduce

Phase 1 - Parse documents (web pages) for links

This phase is simplified by replacing web pages with a network file.

map

A\tD 	(A, D)
B\tC 	(B, C)
C\tA 	(C, A)
D\tB	(D, B)
C\tD	(C, D)

reduce

(A, D)	(A, (1.0:D))
(B, C)	(B, (1.0:C))
(C, A)	(C, (1.0:A,D))
(D, B)	(D, (1.0:B))
(C, D)

notice

PageRank results usually present in a probability distribution. In MapReduce program, there is no idea of number of nodes in the graph, all PageRank values are set to be 1.0 on initialization.

Phase 2 - Compute PageRank (Iteratively approach)

map

(A, (1.0:D))	(D, 1:1)
(B, (1.0:C))	(C, 1:1)
(C, (1.0:A,D))	(A, 1:2)
(D, (1.0:B))	(D, 1:2)
				(B, 1:1)
				**(A, D)**
				**(B, C)**
				**(C, AD)**
				**(D, B)**

reduce

(D, 1:1)		(A, (0.5:D))
(C, 1:1)		(B, (1.0:C))
(A, 1:2)		(C, (1.0:A,D))
(D, 1:2)		(D, (1.5:B))
(B, 1:1)
(A, D)
(B, C)
(C, AD)
(D, B)

notice

In mapping process, the graph structure is kept such that it can be iterative. updaing PageRank

[new pr] = (dumping_value) * [old pr] + (1 - dumping_value)

Phase 3 - Sort the nodes by PageRank

Sort by PageRank value.

About Iterative MapReduce

The program implements simple bash script to do iterative MapReduce. It has to restart the hadoop in each iteration.

Apache Mahout: restarting needed

PEGASUS: restarting needed

haloop: An modified version of Hadoop to support efficient iterative data processing on large commodity clusters

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
PageRank.java		PageRank.java
Phase1.java		Phase1.java
Phase2.java		Phase2.java
Phase3.java		Phase3.java
README.md		README.md
pr		pr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Usage

About PageRank (review)

Computing PageRank using MapReduce

Phase 1 - Parse documents (web pages) for links

map

reduce

notice

Phase 2 - Compute PageRank (Iteratively approach)

map

reduce

notice

Phase 3 - Sort the nodes by PageRank

About Iterative MapReduce

About

Releases

Packages

Contributors 2

Languages

wangzuo/pagerank-hadoop

Folders and files

Latest commit

History

Repository files navigation

Introduction

Usage

About PageRank (review)

Computing PageRank using MapReduce

Phase 1 - Parse documents (web pages) for links

map

reduce

notice

Phase 2 - Compute PageRank (Iteratively approach)

map

reduce

notice

Phase 3 - Sort the nodes by PageRank

About Iterative MapReduce

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages