WikiDump Indexer and Search

Description

Indexer

Parses the wikipedia dump and creates an inverted index. The parser is done using regex. The indexer creates multiple files, token offset files and titles. Check out Nikit's repository for the original parser code.

The dump used here is the 80GB english wiki dump (compressed xml bz2 file size is around 18GB). The inverted index size is around 17GB (when the uncompressed file size is 80GB). The time taken for the index creation was around 5 hours.

This is followed by the merging and secondary inverted index creation which took around 1.5 hours. The merge.py code creates a single index file and a single token offset file. The secinv.py file creates the secondary inverted indexes for the titles and the token offsets.

The final index consists of

the main inverted index invind
vocabulary indoff
titles titles
titles offsets titles_off
document count doc_count
secondary index of vocabulary indoff_secinv.json
secondary index of titles offsets titles_off_secinv.json

You can download the dumps from the following links,

To understand the concepts and do your own version follow this link, Standford Information Retrieval Course

To install the python dependencies, run the command

pip install -r requirements.txt

You can start the indexer using,

bash index.sh <compressed bz2 dump file path> <target index path>

Search

You can search multiple queries as plain query and field query by adding them to a file and send them to search.py and the output is stored in the queries_op.txt.
The output for each query is the top 10 results' titles followed by time taken for search.

You can run the search using,

bash search.sh <index path> <query file path>

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
IRE_Miniproject_Phase2_Requirements.pdf		IRE_Miniproject_Phase2_Requirements.pdf
README.md		README.md
holder.py		holder.py
index.sh		index.sh
indexer.py		indexer.py
merger.py		merger.py
queries.txt		queries.txt
queries_op.txt		queries_op.txt
requirements.txt		requirements.txt
search.py		search.py
search.sh		search.sh
secinv.py		secinv.py
stopwords.py		stopwords.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiDump Indexer and Search

Description

Indexer

Search

About

Releases

Packages

Languages

saiakarsh193/WikiDump-Indexer-and-Search

Folders and files

Latest commit

History

Repository files navigation

WikiDump Indexer and Search

Description

Indexer

Search

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages