The program implments a PageRank iterative calculation on a graph with Hadoop. The input data is a directed network (not html documents). There are mainly two phases of map-reduce. The second phase is iterative.
A bash script pr
is provided.
set $HADOOP_INSTALL
to the directory of hadoop installation.
$ ./pr 1
# run phase 1, input folder: graph_input, output folder: phase1_output
$ ./pr 2 [iteration]
# run phase 2 for [iterative] times, input folder: phase2_output_[i], output_folder: phase2_output_[i+1]
Overall: The ranking of a page is determined by its estimated importance (determined by link structure) instead of by its content. A variation of Eigenvector centrality.
each iteration: PageRank divided by the number of outbound links L(j) probability distribution: the bigger the PageRank, the more important the page d = 0.85 introduction to PageRank
PageRank Convergence: It may never reach convergence for loops. Ususally 10 iterations will be enough.
This phase is simplified by replacing web pages with a network file.
A\tD (A, D)
B\tC (B, C)
C\tA (C, A)
D\tB (D, B)
C\tD (C, D)
(A, D) (A, (1.0:D))
(B, C) (B, (1.0:C))
(C, A) (C, (1.0:A,D))
(D, B) (D, (1.0:B))
(C, D)
PageRank results usually present in a probability distribution. In MapReduce program, there is no idea of number of nodes in the graph, all PageRank values are set to be 1.0 on initialization.
(A, (1.0:D)) (D, 1:1)
(B, (1.0:C)) (C, 1:1)
(C, (1.0:A,D)) (A, 1:2)
(D, (1.0:B)) (D, 1:2)
(B, 1:1)
**(A, D)**
**(B, C)**
**(C, AD)**
**(D, B)**
(D, 1:1) (A, (0.5:D))
(C, 1:1) (B, (1.0:C))
(A, 1:2) (C, (1.0:A,D))
(D, 1:2) (D, (1.5:B))
(B, 1:1)
(A, D)
(B, C)
(C, AD)
(D, B)
In mapping process, the graph structure is kept such that it can be iterative. updaing PageRank
[new pr] = (dumping_value) * [old pr] + (1 - dumping_value)
Sort by PageRank value.
The program implements simple bash script to do iterative MapReduce. It has to restart the hadoop in each iteration.
Apache Mahout: restarting needed
PEGASUS: restarting needed
haloop: An modified version of Hadoop to support efficient iterative data processing on large commodity clusters