##What is Pagerank? PageRank algorithm was created by the founders of Google and it serves as the backbone (underlying algorithm) of the Google search engine. It allocates a weight (a.k.a. pagerank) to every page that is crawled over. This weight is directly proportionate to the relevance given (to the associated page) by Google search - a page with a higher pagerank value is considered to be more relevant and is ranked higher by the search engine during search recall. These weights are computed in an iterative manner and each iteration's completion is contingent upon the number of pages involved - this bit would be more palpable from running the provided code. With the exceptionally large amount of data that needs to be crunched in order for this algorithm to work as desired, it is necessary to adopt a system which enables large data processing, in a time-efficient manner - Hadoop based implementation.
##What is mrJOB? mrJOB is a Python package (created by Yelp) which can be used for writing Hadoop streaming jobs. It's a really neat / simple tool for writing map / reduce jobs in Python. I suppose you ought to have Python 2.5 and above to run this package. One can go about running tasks locally (i.e. on your machine) / on Hadoop (in pseudo-distributed mode / clustered mode) / on Amazon's Elastic MapReduce very conveniently - by just changing couple of command line arguments!
##Dataset Data used for this experiment has been derived from DBpedia. Wikipedia pages can be thought of as a rich source for modeling a networked structure from - each page has multiple outbound links (e.g. See Also,References,Notes sections).
##Setup
- Python 2.7 (Get Python)
- pip (Get pip)
- Open Terminal. Run
sudo pip install mrjob
- If you intend for the outbound links (of pages) to have a Dirichlet distribution: Get numpy
##Experiment ###Running pagerank algorithm on wikipedia's dataset
-
Move the extracted dataset, *.py files to a new folder.
-
Open Terminal, change the cwd to the newly created folder. Run
python getEncodedNodes.py <dataset_name>
If you intend your outbound links to have Dirichlet distribution, Run
python getEncodedNodes_Dirichlet.py <dataset_name>
To run experiment locally -
- From Terminal, Run
python pagerank.py path_of_'result.txt'
To run experiment on Hadoop -
-
Ensure that Hadoop (pseudo-distributed mode / clustered mode) is setup on your system. If not,
- I'd suggest you download Hadoop 1.2.1.
- Setup Hadoop in desired mode - Pseudo-distributed setup,Cluster setup
-
Before running the code, you would have to transfer the extracted dataset to HDFS.
For pseudo-distributed setup, From Terminal, run
~/hadoop-1.2.1/bin/hadoop dfs -copyFromLocal path_of_'result.txt' /input
-
From Terminal, run
python pageRank.py -r hadoop hdfs:////input >out
To run experiment on Elastic MapReduce -
-
Ensure that you have the following : an AWS account setup, signed up for EMR service, EMR configuration file (on your system). If not.
-
If your input file is:
In HDFS: From Terminal, run
python pageRank.py -r emr hdfs:////input >out
Stored locally: From Terminal, run
python pageRank.py -r emr path_of_'result.txt' >out
The output of the experiment will be stored in a file named out.
####To find out the page with the highest pageRank score :
From Terminal, run python getMaxNode.py out
####To change number of iterations / mappers-reducers :
-
python pageRank.py -r hadoop hdfs:////input --jobconf mapred.map.tasks=numberOfMappers --jobconf mapred.reduce.tasks=numberOfReducers > out
e.g. python pageRank.py -r hadoop hdfs:////input --jobconf mapred.map.tasks=5 --jobconf mapred.reduce.tasks=10 > out
-
python pageRank.py -r hadoop hdfs:////input --iterations=numberOfIterations > out
e.g. python pageRank.py -r hadoop hdfs:////input --iterations=5 > out