Skip to content

tehmas/document-retrieval-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Retrieval System

Program for retrieving documents on the basis of a query

Author: Asad Raheem

Licensed under the GNU General Public License version 3.0

Installation

  • Python 2.7
  • nltk 3.2.1
  • beautifulsoup 4.3.2
  • numpy 1.9.2

Description

The system consists of three parts:

  • Tokenizer
  • Indexer
  • Index Reader

1. Tokenizer

Reads a document collection and creates documents containing indexable tokens. The tokenizer extracts text from HTML files and splits the text into tokens. Stop wording is also applied to ignore any stop words in the documents. All the tokens are converted to lower case (this is not always ideal and should be changed accordingly before using the code) and then Porter stemming is applied.

Outputs

  • docids.txt: Maps a document's file name to its document ID (DOCID).
  • termids.txt: Maps a token to its term ID (TERMID).
  • doc_index.txt: Forward index containing position of each term in each file.

Usage

In command prompt or terminal type: python tokenize_doc.py <directory_name>

2. Indexer

Reads a collection of tokenized documents and constructs an inverted index.

Outputs

  • term_index.txt: Inverted index containing file position for each occurence of each term in collection. Each line contains a completed inverted list for a single term i.e. a TERMID is followed by a list of DOCID:POSITION values. Delta encoding is applied to each list.
  • term_info.txt: Used for providing fast access time to the index reader. Each line contains a TERMID followed by offset in bytes (in term_index.txt), occurrences in entire corpus and number of documents in which term appears.

Usage

In command prompt or terminal type: python invert_index.py

3. Index Reader

Looks up offset in term_info.txt and jumps straight to the list in term_index.txt.

Usage

In command prompt or terminal type:
  • python read_index.py --doc DOCNAME
  • python read_index.py --doc DOCNAME --term TERM
  • python --term TERM --doc DOCNAME

About

A program to construct and read an inverted index.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages