This notebook walks through using Whoosh for indexing and searching the Stanford movie reviews dataset.


Hat tip to Abhijeet Kumar for https://appliedmachinelearning.blog/2018/07/31/developing-a-fast-indexing-and-full-text-search-engine-with-whoosh-a-pure-python-library/

In [None]:
# install Whoosh if needed
# !pip install Whoosh

In [1]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2019-01-22 14:58:19--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2019-01-22 14:58:25 (13.5 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [2]:
!tar xzf aclImdb_v1.tar.gz

In [20]:
# remove meta files
!rm aclImdb/imdb.vocab
!rm aclImdb/imdbEr.txt
!rm aclImdb/README
!rm aclImdb/train/labeledBow.feat
!rm aclImdb/train/unsupBow.feat
!rm aclImdb/train/urls_neg.txt
!rm aclImdb/train/urls_pos.txt
!rm aclImdb/train/urls_unsup.txt
!rm aclImdb/test/labeledBow.feat
!rm aclImdb/test/urls_neg.txt
!rm aclImdb/test/urls_pos.txt

With the data ready to go, lets build an index

In [32]:
# scrub the files on the fly
def get_cleaned_string(in_string):
    safechars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890 -./'
    cleaned_list = []
    for s in in_string:
        if s in safechars:
            cleaned_list.append(s)
        else:
            cleaned_list.append(' ')
    return ''.join(cleaned_list)


In [42]:
import os, glob

from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID
import sys
 
def createSearchableData(root):   
 
    '''
    Schema definition: title(name of file), path(as ID), content(indexed
    but not stored),textdata (stored text content)
    '''
    schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\
              content=TEXT,textdata=TEXT(stored=True))
    if not os.path.exists("indexdir"):
        os.mkdir("indexdir")
 
    # Creating a index writer to add document as per schema
    ix = create_in("indexdir",schema)
    writer = ix.writer()
     
    for filename in glob.iglob(root + '/**/*.txt', recursive=True):
        with open(filename) as f:
            data = f.read().replace('\n', '')
            text = get_cleaned_string(data)
            writer.add_document(title=os.path.basename(f.name), path=os.path.realpath(f.name),content=text,textdata=text)
    writer.commit()
 

In [43]:
# remove the old index and rebuild for this data domain
!rm -rdf indexdir
root = "aclImdb"
createSearchableData(root)

Now that we have built the index, lets query it

In [50]:
from whoosh.qparser import QueryParser
from whoosh import scoring
from whoosh.index import open_dir
 
ix = open_dir("indexdir")
 
# query_str is query string
query_str = "James Bond"
# Top 'n' documents as result
topN = 1
 
with ix.searcher(weighting=scoring.Frequency) as searcher:
    query = QueryParser("content", ix.schema).parse(query_str)
    results = searcher.search(query,limit=topN)

    for i in range(topN):
        print(results[i]['title'], str(results[i].score),
        results[i]['textdata'] )#results[i]['path']

42455_0.txt 36.0 If you consider yourself a James Bond fan and yet enjoyed this film  there is a problem. br /  br / Just like everyone else  when I first saw that Daniel Craig was to replace Pierce Brosnan in the role  I was a bit confused. His ice cold looks seemed to be quite a stretch from the image we have of James Bond. Maybe  they  know some things I don t about 007  maybe I ve been missing something about the character. Plus the hype around the production was excellent the rumor was that the filmmakers have decided to be more daring in many aspects. Nothing wrong with that  as a long as you know what you re doing. br /  br / But at the very first frame of the film my original skepticism re-emerged  br /  br / The opening scene happens in a sombre black and white cold war setting in which Bond makes no spectacular entrance  chatting with his enemy and finishing the mission with his fists inside a...dirty public restroom. Then Bond spins around  aiming his gun at the camera  taki