- Install Python 3.5+
- Install NLTK 3
- Open terminal / command prompt and enter following command:
$ python >>> import nltk >>> nltk.download('stopwords')
To index data, run index.py
script and pass document's directory and directory for storing indexed data:
$ python index.py --help
usage: index.py [-h] docs_path data_path
Index data for boolean retrieval
positional arguments:
docs_path Directory for documents to be indexed
data_path Directory for storing indexed data
optional arguments:
-h, --help show this help message and exit
$ python index.py ./docs ./data
After indexing data successfully, run query.py
script to perform query:
$ python query.py --help
usage: query.py [-h] query
Boolean query
positional arguments:
query words seperated by space
optional arguments:
-h, --help show this help message and exit
$ python query.py "popular available"
{'D:\\workspace\\boolean-retrieval-engine\\docs\\A Festival of Books.txt'}
When provide input for the query script, words must be seperated by space. For example, with input "popular available"
, it's mean that find all documents which contain popular
AND available
. The returned result will be a set of documents satisfy the query. All numeric, punctuation and word which is not in dictionary will be ignored.