GitHub - silvan-software/PageRanker: Implementation of the Page Rank algorithm - from a Job Interview many years ago

Page Rank implementation by Joseph P. Silva

Based on "Searching the web" in ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001

And also WikiPedia article on PageRank algorithm at http://en.wikipedia.org/wiki/PageRank

duplicate links are ignored after the first one seen
links that go to same page are ignored
leak sinks handled via decay factor per the documents listed above
leak nodes are treated as if they link out to all other pages

1: Use the standard *nix make file to build executable "PageRanker"

2: Usage: PageRanker filename num_iterations decay_factor

3: Run it against the test data files:

PageRanker testBadInput 1 1

PageRanker test13a 100 1

PageRanker test13b 100 .8

PageRanker testWikiExample 100 .85

testBadInput should report:

line 2 page links to itself...ignored
line 3 missing second page...ignored
line 4 no pages...ignored
INITIAL RANKS 
C 0.333333
B 0.333333
A 0.333333
*** ITERATION 1***
C 0.333333
B 0.5
A 0.166667

test13a and test13b based on similarly named diagram in the ACM paper

Expected final results for 13a:

*** ITERATION 100***
5 0.142857
4 0.142857
3 0.142857
2 0.285714
1 0.285714

values exactly match the paper

Expected final results for 13b:

*** ITERATION 100***
5 0.290177
4 0.312721
3 0.101449
2 0.153623
1 0.142029

values match the paper except the latter clearly has a typo where r(1) and r(2) are swapped

testWikiExample expected results:

*** ITERATION 100***
J 0.0162111
I 0.0162111
H 0.0162111
F 0.0391877
K 0.0162111
G 0.0162111
E 0.081094
A 0.0302911
D 0.0391877
C 0.343793
B 0.385391

testWikiExample based on http://en.wikipedia.org/wiki/File:PageRanks-Example.svg note the values differ slightly from the diagram, perhaps they did some rounding slightly differently or handled leak pages slightly differently

The Diagram has:

A  3.3%
B 38.4%
C 34.3%
D  3.9%
E  8.1%
F  3.9%
The other 5 unnamed nodes (G-K in test file): 1.6% each

Code design thoughts:

1: The code is probably recomputing string hashes for the URLs as each link is looked up and then stored in the unsorted maps

If instead of URLs the input is a numeric page ID coming from the Page Repository, the hash computation could be much quicker for instance just a cast to size_t

2: Iterating through the unsorted map contents during ranking may be slower then iterating through a simple array

It might be worth building a simple array for the page list and maybe links lists just before page ranking and use those rather than an iterator from the unsorted maps

3: It might be worth writing unit tests with for instance GoogleTest, based on the test files per above

If one were to be changing this code over time could be worthwhile

4: Per the ACM paper and also Wiki page 100 iterations is probably more than needed for convergence close to the Eigenvector

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
PageRanker.cpp		PageRanker.cpp
PageRanker.h		PageRanker.h
README.md		README.md
RankedPage.cpp		RankedPage.cpp
RankedPage.h		RankedPage.h
main.cpp		main.cpp
makefile		makefile
test13a		test13a
test13b		test13b
testBadInput		testBadInput
testWikiExample		testWikiExample

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages