The pagerank.c file reads data from a given collection of pages in the file collection.txt and builds a graph structure using an Adjacency Matrix. Using the algorithm described below, we can calculate the Weighted PageRank for every url in the file collection.txt. In this file, urls are separated by one or more spaces or/and new line character. Added suffix '.txt' to a url to obtain file name of the corresponding "web page". For example, file url24.txt contains the required information for url24. The algorithm to update PageRank values is given below.
Definitions
Set containing nodes (urls) with outgoing links to (ignore self-loops and parallel edges)
Weight of link(v,u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages of page v.
Represent the number of inlinks of page u and page p, respectively.
Weight of link(v, u) calculated based on the number of outlinks of page u and the number of outlinks of all reference pages of page v.
Represent the number of outlinks of page u and page p, respectively.
Denotes the reference page list of page v.
Correspond to values of iterations.
Refer to Weighted PageRank Algorithm for more information.
Assumptions
For calculating , if a node k has 0 out degree (zero outlink), should be 0.5 and not 0. This will aviod issues related to division by 0.
Algorithm for PageRank Calculation
Start PageRankW(d, diffPR, maxIterations)
Read "web pages" from the collection in file "collection.txt" and build a graph structure using Adjacency List Representation
N = number of urls in the collection
For each url p_i in the collection
End For
iteration = 0;
diff = diffPR; // to enter the following loop
While (iteration < maxIteration AND diff >= diffPR)
iteration++;
End While
End PageRankW(d, diffPR, maxIterations)
The file searchTfIdf.c receives search terms (words) as command-line arguments and outputs (to stdout) top 30 pages in descending order of number of search terms found and then within each group, arranges search terms in descending order of summation of tf-idf values. The program also outputs the corresponding summation of tf-idf along with each page, separated by a space and using format "%.6f". See example below.
Example
% searchTfIdf mars design
url25 1.902350
url31 0.434000
Term Frequency Calculation
The term frequency is given by
where is the frequency of term t in document d and is the total number of words in d.
Inverse Document Frequency Calculation
The inverse document frequency is given by
where is the total number of documents and is the set of all documents.
Refer to tf-idf for more information on these calculations.
Here, we combine search results (ranks) from Part 1 and Part 2 using the "Scaled Footrule Rank Aggregation" method. Let T1 and T2 be the search results (ranks) obtained using Part 1 and Part 2 respectively.
Then, a weighted bipartite graph for scaled footrule optimization (C,P,W) is defined as follows:
- Let 'C' be the set of nodes to be ranked (union of T1 and T2).
- Let P be the set of positions available (say {1, 2, 3, 4, 5}).
Then, W(c,p) is the scaled-footrule distance (from T1 and T2) of a ranking that places element 'c' at position 'p', given by:
- |T1| is the cardinality (size) of T1
- |T2| is the cardinality (size) of T2
- is the position of in
- k is the number of rank lists