# Whoosh

## Introduction

Whoosh is a fast, search library implemented in python. It is an alternative to pylucene and does not have the dependency on having Java installed. It allows indexing of structured or free-form data and then use simple or complex search queries to find matching documents. It can be used to add search functionality to applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly. It is highly extensible allowing extending or replacing any part of the functionality including indexing, parsing of queries, and the fields stored in index.


## Indexing

### Schema

To create an index with the whoosh, the first step is to define the schema of the index. The schema is used to define the fields in the index. A field is a piece of information for each document in the index, such as its title or text content. Fields can either be indexed, implying their value will be searched or stored meaning value will be returned with the results but not searched. Whoosh has several predefined field types including TEXT (indexed), STORED (stored but not indexed), KEYWORD, ID, NUMERIC, DATETIME and BOOLEAN. 

In [14]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED, NUMERIC
from whoosh.analysis import StemmingAnalyzer

schema = Schema(source=STORED,
                rating=NUMERIC,
                year=NUMERIC,
                title=ID(stored=True),
                moviereview=TEXT(stored=True, analyzer=StemmingAnalyzer())
               )

### Creating an index
Once the schema has been defined, the next step is to use the schema to define the index. This can be done using the index.create_in method. The schema used to create the index is also stored with the index. Once the index is created, an indexwwriter can be used to add documents to the index. The indexwriter can be obtained by using the writer method on the index

In [None]:
import os
from whoosh import index

if not os.path.exists("indexdirectory"):
    os.mkdir("indexdirectory")

from whoosh.filedb.filestore import FileStorage
storage = FileStorage("indexdirectory")

# Create an index
ix = storage.create_index(schema)

# Obtaining a writer for the index
ix = index.open_dir("indexdirectory")
writer = ix.writer()

### Adding documents to the schema
Once the writer for the index is obtained, new documents can be added to the writer by using the add_document method. The documents might use a subset of the available fields in the schema or they may use all the fields available

In [16]:
# Loading the data
import urllib, json

with urllib.request.urlopen('https://raw.githubusercontent.com/lukhnos/lucenestudy/master/sample/acl-imdb-subset.json') as url:
    data = json.loads(url.read().decode())
    for review in data:
        writer.add_document(source=review["source"], rating=int(review["rating"]), year=int(review["year"]),
                            title=review["title"], moviereview=review["review"])
    writer.commit()

## Searching

### QueryParser
The query parser is used to parse queries into query objects (whoosh.query module), using a parsing language similar to that of lucene. The parser takes in the field which needs to be searched and the schema of the index. After that, the parse method can be used for parsing queries.

If the user specifies multiple words without specifying "AND" or "OR" clauses, the parser treats the words as if they were connected with AND, the user can explicitly specify the "OR" clause to match documents where either of the word is present

In [32]:
from whoosh.qparser import QueryParser

parser = QueryParser("moviereview", schema=ix.schema)

q = parser.parse("war")
print("Single word query - ", q)

# Query with multiple words
q2 = parser.parse("crime action")
print("Multiple word query - ", q2)

# Query with multiple words, explicit AND
q3 = parser.parse("crime AND action")
print("Multiple word query explicitly connected with AND - ", q3)

# Query with multiple words, with OR 
q4 = parser.parse("crime OR action")
print("Multiple word query connected with OR - ", q4)

Single word query -  moviereview:war
Multiple word query -  (moviereview:crime AND moviereview:action)
Multiple word query explicitly connected with AND -  (moviereview:crime AND moviereview:action)
Multiple word query connected with OR -  (moviereview:crime OR moviereview:action)


The query module provides additional functions which can be used to modify query parsing functionality or the query language syntax. Some of the functionalities available are -
- **whoosh.qparser.MultifieldParser()** allows users to search for multiple fields in the index instead of a single field.
- **whoosh.qparser.FuzzyTermPlugin** can be added to the queries to search for fuzzy terms, which can be used for catching misspellings and similar words. The fuzzy term will match any similar term within a certain number of “edits”
- **Phrase Querying** - The default query parser also supports phrase queries, when multiple word queries are used. The parser tokenizes the text between the quotes and searches for these terms in close proximity.


In [59]:
#Multiple field query
from whoosh.qparser import MultifieldParser

mparser = MultifieldParser(["title", "moviereview"], schema=ix.schema)
multifieldQuery = mparser.parse("war")
print("Multiple field query - ", multifieldQuery, '\n')

#Fuzzy queries
fuzzyparser = whoosh.qparser.QueryParser("moviereview", schema=ix.schema)
fuzzyparser.add_plugin(whoosh.qparser.FuzzyTermPlugin()) #Add fuzzy term plugin to parser
fuzzyquery = fuzzyparser.parse("apocalyps~")
print("Fuzzy Term Query ", fuzzyquery)

with ix.searcher() as searcher:
    results = searcher.search(fuzzyquery)
    print(results[0], '\n') # Searching for apocalyps~
    
#Phrase Query
phraseQuery = parser.parse("Attack of the Fifty Foot Woman")
print("Phrase Query ", phraseQuery)
with ix.searcher() as searcher:
    results = searcher.search(phraseQuery)
    print(results[0], '\n')


Multiple field query -  (title:war OR moviereview:war) 

Fuzzy Term Query  moviereview:apocalyp~
<Hit {'moviereview': "Breathtaking at it's best, intriguing at it's worst, Francis Ford Coppala's groundbreaking epic 'Apocalypse Now' is one of the most iconic and celebrated motion pictures of the 20th century, and in my opinion, the greatest ever film depiction centered around America's involvement in Vietnam.\n\nWhat I like most about 'Apocalypse Now' is that it is uniquely different from any other films of the same genre. Growing up as movie buff, and with a particular interest in war films, I've seen many films, which have attempted to portray the 'images' and 'feelings' of Vietnam but have been unsuccessful in doing so. Films such as 'Hamburger Hill' and 'We were soldiers' fall into the category of trying to capture the atmosphere of Vietnam by depicting 'heroic battles' which are, more often than not, tainted by the zeal of Hollywood film production.\n\nIn 'Apocalypse now' there are

### Searcher
Once the index is created, we can search for the documents by issuing queries to it. The library provides a searcher object for issuing queries to the index and obtaining results. The searcher object also provides additional information about the index, such as the document count in the index, the document numbers that can be used for deleting, updating documents later, the postings list present etc.

The main method to search for queries is the search() method which takes a query and returns a list of results. The number of results returned by the searcher can be changed by passing a limit parameter to the search method to speed up the query. The searcher also includes a **search_page** method that allows getting the results page by page. The default page length is 10 hits. You can use the pagelen keyword argument to set a different page length.

In [73]:
with ix.searcher() as searcher:
    q = parser.parse("action")
    docCount = searcher.doc_count()
    print("Documents in index - ", docCount, '\n')
    results = searcher.search(q)
    print('Number of matches found - ', len(results))
    print(results[0], '\n')
    
    # Searching by page 
    resultsPage1 = searcher.search_page(q,1)
    print("Page 1 top result")
    print(resultsPage1[0], '\n')
    resultsPage2 = searcher.search_page(q,2)
    print("Page 2 top result")
    print(resultsPage2[0], '\n')
    

Documents in index -  1000 

Number of matches found -  109
<Hit {'moviereview': "I have so much hope for the sequel to Gen-X. Luckily, my hopes have came true. You got a whole bunch of action, comedy...silly comedy, and surprises. I think the newcomer Edison, is really a hit in the movie, but I really find Sam's 'Alien' stupidly annoying with English. Although the movie had some flaws with the robot graphics and the silly dialogue, the action always keeps it strong. The action set-up is much stronger than the 1st.\n\nThis movie is getting more of an American feel since 60% of the movie is in English from the Cantonese. This movie will not disappoint you. I recommended this for young 'uns that care about pure action-packed fun.\n\n", 'source': 'http://www.imdb.com/title/tt0251094/usercomments', 'title': 'Te jing xin ren lei 2'}> 

Page 1 top result
<Hit {'moviereview': "I have so much hope for the sequel to Gen-X. Luckily, my hopes have came true. You got a whole bunch of action, comed

### Results
The results object that is returned by the search method can be used to access the stored fields of the document to display to the user. The default number of results is bounded by an upper limit so the number of results could be less than the matching documents. However, running len(results) runs a fast version of the query again to find the total number of matching documents

In [85]:
with ix.searcher() as searcher:
    q = mparser.parse("apocalypse now")
    print(q,  '\n')
    results = searcher.search(q)
    topResult = results[0]
    print("Movie title - ", topResult["title"], '\n')
    print("Movie Review - ", topResult["moviereview"], '\n')


((title:apocalypse OR moviereview:apocalyps) AND (title:now OR moviereview:now)) 

Movie title -  Apocalypse Now 

Movie Review -  Breathtaking at it's best, intriguing at it's worst, Francis Ford Coppala's groundbreaking epic 'Apocalypse Now' is one of the most iconic and celebrated motion pictures of the 20th century, and in my opinion, the greatest ever film depiction centered around America's involvement in Vietnam.

What I like most about 'Apocalypse Now' is that it is uniquely different from any other films of the same genre. Growing up as movie buff, and with a particular interest in war films, I've seen many films, which have attempted to portray the 'images' and 'feelings' of Vietnam but have been unsuccessful in doing so. Films such as 'Hamburger Hill' and 'We were soldiers' fall into the category of trying to capture the atmosphere of Vietnam by depicting 'heroic battles' which are, more often than not, tainted by the zeal of Hollywood film production.

In 'Apocalypse now' t

### Scoring
The list of returned results is sorted by the score of the documents against the query. The default scoring algorithm used is **BM25**. The **whoosh.scoring** module contains various scoring algorithms that can be set while creating the searcher object to change the scoring behavior

In [87]:
from whoosh import scoring

# Change scoring to TF_IDF
with ix.searcher(weighting=scoring.TF_IDF()) as searcher:
    q = mparser.parse("apocalypse now")
    results = searcher.search(q)
    print(results[0])

<Hit {'moviereview': "Breathtaking at it's best, intriguing at it's worst, Francis Ford Coppala's groundbreaking epic 'Apocalypse Now' is one of the most iconic and celebrated motion pictures of the 20th century, and in my opinion, the greatest ever film depiction centered around America's involvement in Vietnam.\n\nWhat I like most about 'Apocalypse Now' is that it is uniquely different from any other films of the same genre. Growing up as movie buff, and with a particular interest in war films, I've seen many films, which have attempted to portray the 'images' and 'feelings' of Vietnam but have been unsuccessful in doing so. Films such as 'Hamburger Hill' and 'We were soldiers' fall into the category of trying to capture the atmosphere of Vietnam by depicting 'heroic battles' which are, more often than not, tainted by the zeal of Hollywood film production.\n\nIn 'Apocalypse now' there are no battles, no heroes or villains, there is nothing in the film that suggests that it is intende

### Highlighting
The **highlights** method on the whoosh.searching.hit object can be used to get highlighted text from the document containing the search term. It takes an argument, which is the field name to be displayed.

In [118]:
from IPython.core.display import display, HTML
with ix.searcher() as searcher:
    results = searcher.search(q)
    display(HTML(results[0].highlights("moviereview")))
        
    actionQuery = mparser.parse("action")    
    results = searcher.search(actionQuery)
    display(HTML(results[0].highlights("moviereview")))


### Correction
Whoosh also has the capability to suggest corrections for mistyped words. It does this by returning words from the index that are close to the mis-typed word. The **whoosh.spelling.Corrector.suggest()** method can be used for this. The suggestions can be done using the indexed words, which causes the suggestions to be tailored to the content of the documents. Another way of suggestions is to use a fixed word list to search for suggestions

In [126]:
# Create a corrector from field moviereview
with ix.searcher() as searcher:
    corrector = searcher.corrector("moviereview")
    mistyped_words = ["docter", "bugdet", "posiedon"]
    for word in mistyped_words:
        print(word, " -> ",corrector.suggest(word, limit=1))

docter  ->  ['doctor']
bugdet  ->  ['budget']
posiedon  ->  ['poseidon']


## Summary
- Whoosh provides a pure python search library that can be used to add search functionalities to application.
- Whoosh can be used as an alternative to pylucene when possible providing better integration, avoiding external dependency on Java and JVM, and faster customization if code modifications are required to add additional functionalities.
- Whoosh provides a wide list of features for searching and indexing, including word corrections, fuzzy-term queries, phrase-queries, highlighting queried text etc.
- Whoosh is highly customizable, allowing easily extending or replacing query parsing, scoring, indexing with custom implementations.
- On the downside, the pure python nature does add a performance overhead compared to other libraries. Also pylucene is built on lucene which has a wide developer community and a richer feature list than whoosh.

## References
- [Whoosh Docs](https://whoosh.readthedocs.io/en/latest)
- [Data for indexing](https://github.com/lukhnos/lucenestudy)
- [Whoosh example](https://appliedmachinelearning.blog/2018/07/31/developing-a-fast-indexing-and-full-text-search-engine-with-whoosh-a-pure-python-library/)