WikiSearch

The project implements a Wikipedia-Based Search Engine involving indexing of 46.7 GB WikiXML Dump with following features:

Index compression techniques for faster retrieval (150ms).
Document ranking algorithms for relevant results.

Functional Overview:

A. Preprocessing and Indexing of the Wiki dump:

Some key components involved from the preprocessing perspective:

a. Spotting of the tagName in documents.
b. Tokenize texts in the sample corpus.
c. Stop word removal and case folding.
d. Stemming.

Other main component is indexing handled in Indexing.java with following data structures involved:

a. HashMap as the outer data structure to store the words in the corpus.
b. Sort the HashMap to store the data lexographically in the index file. This is necessary to optimise search operation.
c. Hashmap values contain references to the inner data structure which is a "TreeMap" to store document Ids in sorted format.

B. Searching:

Preprocess keyword to be searched (i.e. stemming, etc).
Look for the preprocessed keyword in the index file (binary search used for this).
Apply some document ranking algorithms (like Tf/Idf) and return the most relevant search result in 150ms on an average.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

src

src

README.md

README.md

Repository files navigation

WikiSearch

About

Releases

Packages

Languages

skyllar/WikiSearch

Folders and files

Latest commit

History

Repository files navigation

WikiSearch

About

Resources

Stars

Watchers

Forks

Languages