GitHub - sakshamyadav/Search-Engine: C-based Search Engine

Part 1: Graph structure-based Search Engine

The pagerank.c file reads data from a given collection of pages in the file collection.txt and builds a graph structure using an Adjacency Matrix. Using the algorithm described below, we can calculate the Weighted PageRank for every url in the file collection.txt. In this file, urls are separated by one or more spaces or/and new line character. Added suffix '.txt' to a url to obtain file name of the corresponding "web page". For example, file url24.txt contains the required information for url24. The algorithm to update PageRank values is given below.

Definitions

Set containing nodes (urls) with outgoing links to (ignore self-loops and parallel edges)

Weight of link(v,u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages of page v.

Represent the number of inlinks of page u and page p, respectively.

Weight of link(v, u) calculated based on the number of outlinks of page u and the number of outlinks of all reference pages of page v.

Represent the number of outlinks of page u and page p, respectively.

Denotes the reference page list of page v.

Correspond to values of iterations.

Refer to Weighted PageRank Algorithm for more information.

Assumptions

For calculating , if a node k has 0 out degree (zero outlink), should be 0.5 and not 0. This will aviod issues related to division by 0.

Algorithm for PageRank Calculation

Start PageRankW(d, diffPR, maxIterations)

	Read "web pages" from the collection in file "collection.txt" and build a graph structure using Adjacency List Representation

	N = number of urls in the collection

	For each url p_i in the collection

	End For 

	iteration = 0;
   	diff = diffPR;   // to enter the following loop

    While (iteration < maxIteration AND diff >= diffPR)
    	iteration++;

	End While
	
End PageRankW(d, diffPR, maxIterations)

Part-2: Content-based Search Engine

The file searchTfIdf.c receives search terms (words) as command-line arguments and outputs (to stdout) top 30 pages in descending order of number of search terms found and then within each group, arranges search terms in descending order of summation of tf-idf values. The program also outputs the corresponding summation of tf-idf along with each page, separated by a space and using format "%.6f". See example below.

Example

% searchTfIdf  mars  design
    url25  1.902350
    url31  0.434000

Term Frequency Calculation

The term frequency is given by

where is the frequency of term t in document d and is the total number of words in d.

Inverse Document Frequency Calculation

The inverse document frequency is given by

where is the total number of documents and is the set of all documents.

Refer to tf-idf for more information on these calculations.

Part-3: Hybrid/Meta Search Engine using Rank Aggregation

Here, we combine search results (ranks) from Part 1 and Part 2 using the "Scaled Footrule Rank Aggregation" method. Let T1 and T2 be the search results (ranks) obtained using Part 1 and Part 2 respectively.

Then, a weighted bipartite graph for scaled footrule optimization (C,P,W) is defined as follows:

Let 'C' be the set of nodes to be ranked (union of T1 and T2).
Let P be the set of positions available (say {1, 2, 3, 4, 5}).

Then, W(c,p) is the scaled-footrule distance (from T1 and T2) of a ranking that places element 'c' at position 'p', given by:

where

|T1| is the cardinality (size) of T1
|T2| is the cardinality (size) of T2
is the position of in
k is the number of rank lists

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
LICENSE		LICENSE
README.md		README.md
collection.txt		collection.txt
graph.c		graph.c
graph.h		graph.h
inverted.c		inverted.c
inverted.h		inverted.h
pagerank.c		pagerank.c
pagerank.h		pagerank.h
rankA.txt		rankA.txt
rankB.txt		rankB.txt
rankC.txt		rankC.txt
readData.c		readData.c
readData.h		readData.h
scaledFootrule.c		scaledFootrule.c
searchPagerank.c		searchPagerank.c
searchPagerank.h		searchPagerank.h
searchTfIdf.c		searchTfIdf.c
searchTfIdf.h		searchTfIdf.h
url11.txt		url11.txt
url21.txt		url21.txt
url22.txt		url22.txt
url23.txt		url23.txt
url31.txt		url31.txt
url32.txt		url32.txt
url34.txt		url34.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Part 1: Graph structure-based Search Engine

Part-2: Content-based Search Engine

Part-3: Hybrid/Meta Search Engine using Rank Aggregation

About

Releases

Packages

Contributors 3

Languages

License

sakshamyadav/Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Part 1: Graph structure-based Search Engine

Part-2: Content-based Search Engine

Part-3: Hybrid/Meta Search Engine using Rank Aggregation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages