Skip to content

sakshamyadav/Search-Engine

Repository files navigation

Part 1: Graph structure-based Search Engine

The pagerank.c file reads data from a given collection of pages in the file collection.txt and builds a graph structure using an Adjacency Matrix. Using the algorithm described below, we can calculate the Weighted PageRank for every url in the file collection.txt. In this file, urls are separated by one or more spaces or/and new line character. Added suffix '.txt' to a url to obtain file name of the corresponding "web page". For example, file url24.txt contains the required information for url24. The algorithm to update PageRank values is given below.

Definitions

equation

Set containing nodes (urls) with outgoing links to equation (ignore self-loops and parallel edges)

equation

Weight of link(v,u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages of page v.

equation

Represent the number of inlinks of page u and page p, respectively.

equation

Weight of link(v, u) calculated based on the number of outlinks of page u and the number of outlinks of all reference pages of page v.

equation

Represent the number of outlinks of page u and page p, respectively.

equation

Denotes the reference page list of page v.

equation

Correspond to values of iterations.

Refer to Weighted PageRank Algorithm for more information.

Assumptions

For calculating equation , if a node k has 0 out degree (zero outlink), equation should be 0.5 and not 0. This will aviod issues related to division by 0.

Algorithm for PageRank Calculation

Start PageRankW(d, diffPR, maxIterations)

	Read "web pages" from the collection in file "collection.txt" and build a graph structure using Adjacency List Representation

	N = number of urls in the collection

	For each url p_i in the collection

equation

	End For 

	iteration = 0;
   	diff = diffPR;   // to enter the following loop

    While (iteration < maxIteration AND diff >= diffPR)
    	iteration++;

equation

equation

	End While
	
End PageRankW(d, diffPR, maxIterations)

Part-2: Content-based Search Engine

The file searchTfIdf.c receives search terms (words) as command-line arguments and outputs (to stdout) top 30 pages in descending order of number of search terms found and then within each group, arranges search terms in descending order of summation of tf-idf values. The program also outputs the corresponding summation of tf-idf along with each page, separated by a space and using format "%.6f". See example below.

Example

% searchTfIdf  mars  design
    url25  1.902350
    url31  0.434000

Term Frequency Calculation

The term frequency is given by

equation where equation is the frequency of term t in document d and equation is the total number of words in d.

Inverse Document Frequency Calculation

The inverse document frequency is given by

equation where equation is the total number of documents and equation is the set of all documents.

Refer to tf-idf for more information on these calculations.


Part-3: Hybrid/Meta Search Engine using Rank Aggregation

Here, we combine search results (ranks) from Part 1 and Part 2 using the "Scaled Footrule Rank Aggregation" method. Let T1 and T2 be the search results (ranks) obtained using Part 1 and Part 2 respectively.

Then, a weighted bipartite graph for scaled footrule optimization (C,P,W) is defined as follows:

  • Let 'C' be the set of nodes to be ranked (union of T1 and T2).
  • Let P be the set of positions available (say {1, 2, 3, 4, 5}).

Then, W(c,p) is the scaled-footrule distance (from T1 and T2) of a ranking that places element 'c' at position 'p', given by:

equation where

  • equation
  • |T1| is the cardinality (size) of T1
  • |T2| is the cardinality (size) of T2
  • equation is the position of equation in equation
  • k is the number of rank lists

About

C-based Search Engine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published