## Getting Started
Let's put what we've learned about Lucene this far into practice. We're going to code our very own Full Text Search engine by developing our own analyzer, inverted index, queries and relevance scorer. 

## Installing Pre-Requisites
During the analyzing phase we will need to stem our tokens to ensure that different variations of a word `ie. brewery, breweries, brewing --> brew`. We are stripping the word of its suffix, storing only the root of the token in our inverted index.

In [1]:
! pip install Pystemmer



## Building the Analyzer
Every sequence of text that will be indexed will first need to be analyzed. If you recall from the github repository, an analyzer is just a combination of characer filter(s), tokenizer(s) and token filter(s).

In [2]:
import Stemmer
import re
import string

In [3]:
# the tokenize function is responsible for taking our sequence of text and splitting them on white space to provide us with tokens.
def tokenize(text):
    return text.split()

# the lowercase filter is responsible for converting all of our tokens into lowercase
def lowercase_filter(tokens):
    return [token.lower() for token in tokens]

# the punction filter is responsible for ridding our tokens of any punctuation
def punctuation_filter(tokens):
    PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))
    return [PUNCTUATION.sub('', token) for token in tokens]

# the stem filter function is responsible for stemming our tokens (as described 2 cells above)
def stem_filter(tokens):
    STEMMER = Stemmer.Stemmer('english')
    return STEMMER.stemWords(tokens)

# the stopwords filter is meant to filter out common stopwords that can impact our search scoring and indexing
def stopword_filter(tokens):
    STOPWORDS = set(['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
                     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
                     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'wikipedia'])
    return [token for token in tokens if token not in STOPWORDS]

In [4]:
# The analyze function is meant to put tokenizer and token filters together and execute. 
def analyze(text):
    tokens = tokenize(text)
    tokens = lowercase_filter(tokens)
    tokens = punctuation_filter(tokens)
    tokens = stopword_filter(tokens)
    tokens = stem_filter(tokens)

    return [token for token in tokens if token]

## Testing the Analyzer we've built
Let's run a sample sequence of text against the Analyzer we've built

In [5]:
analyze("The quick brown fox jumps over the lazy dog.")

['quick', 'brown', 'fox', 'jump', 'over', 'lazi', 'dog']

## Indexing a dataset
The next step after we've built a working analyzer is to analyze some real data. Let's import a movies dataset from a JSON file and then write the index function needed to analyze and index the movie titles.

In order to work with json files, we'll first need to import the json python package.

In [6]:
import json

In [7]:
# importing the movies collection as a dictionary
filename = 'data/movies.json'
with open(filename, 'r') as f:
    documents = json.load(f)

# The index function will instantiate an empty dictionary before filling it with analyzed tokens from our dataset. 
# We will be assigning the token as the key of the dictionary and the object_ids as the value(s) of the key

def index():
    index={}
    # for each movie, run the analyzer function above on title and add it to a set with the movies' ID
    for document in documents:
        for token in analyze(document['title']):
            if token not in index:
                index[token] = set()
            index[token].add(document['_id']['$oid'])
            
    return index

## Search
Now we'll need to be able to define how we want to search against our dataset given our defined analyzer and inverted index. 

In [8]:
# The Search function is responsible for taking in a query, analyzing it using the analyzer code from above, 
# and then retrieving the corresponding object_ids in our index that match with our tokens. From there we can 
# lookup all the movies in our dataset that we've matched against. 
def search(query):
    # tokenize the query     
    analyzed_query = analyze(query)
    # grab movie tokens from the index that match the tokens from the query    
    results = [index().get(token, set()) for token in analyzed_query]
    
    resulting_documents = []
    
    ids = set()
    for result in results:
        for singles in result:
            ids.add(singles)
    
    # return all movies where the tokenized query matches the tokenized title
    for single_id in ids:
        for document in documents:
            if document['_id']['$oid'] == single_id:
                resulting_documents.append(document)
    return resulting_documents
    
search("forrest gump")

[{'_id': {'$oid': '573a1399f29313caabcee607'},
  'fullplot': "Forrest Gump is a simple man with a low I.Q. but good intentions. He is running through childhood with his best and only friend Jenny. His 'mama' teaches him the ways of life and leaves him to choose his destiny. Forrest joins the army for service in Vietnam, finding new friends called Dan and Bubba, he wins medals, creates a famous shrimp fishing fleet, inspires people to jog, starts a ping-pong craze, create the smiley, write bumper stickers and songs, donating to people and meeting the president several times. However, this is all irrelevant to Forrest who can only think of his childhood sweetheart Jenny Curran. Who has messed up her life. Although in the end all he wants to prove is that anyone can love anyone.",
  'imdb': {'rating': 8.8, 'votes': 1087227, 'id': 109830},
  'year': 1994,
  'plot': 'Forrest Gump, while not intelligent, has accidentally been present at many historic moments, but his true love, Jenny Curran,

## Scoring 
Recall that in order to score documents we need to calculate the term frequency and the inverse document frequency (TF-IDF). Typically we'll see these as separate functions within an index class, however for simplicity we're going to include those calculations within our search function itself. 

note the `tf`, `idf` and `score` variables now.

note: be sure to check the comments in the code as well

In [9]:
# we will need math to do the idf calculation
import math

In [10]:
# The Search function is responsible for taking in a query, analyzing it using the analyzer code from above, 
# and then retrieving the corresponding object_ids in our index that match with our tokens. From there we can 
# lookup all the movies in our dataset that we've matched against. 
def search(query):
    # tokenize the query     
    analyzed_query = analyze(query)
    # grab movie tokens from the index that match the tokens from the query    
    results = [index().get(token, set()) for token in analyzed_query]
    
    resulting_documents = []
    
    ids = set()
    for result in results:
        for singles in result:
            ids.add(singles)
    
    # return all movies where the tokenized query matches the tokenized title
    for single_id in ids:
        for document in documents:
            if document['_id']['$oid'] == single_id:
                score = 0.0
                for token in analyzed_query:
                    #normally you would want to analyze the title but for simplicy i'm just going to lower it. 
                    #since our analyzer includes a stemmer, the stemmed token should be included in the title if there's a match
                    tf = document['title'].lower().count(token)
                    idf = math.log10(len(documents) / len(index().get(token)))
                    score += tf * idf
                resulting_documents.append((document, score))
            
                # resulting_documents.append(document)
    return sorted(resulting_documents, key=lambda doc: doc[1], reverse=True)
    
search("forrest gump")

[({'_id': {'$oid': '573a1399f29313caabcee607'},
   'fullplot': "Forrest Gump is a simple man with a low I.Q. but good intentions. He is running through childhood with his best and only friend Jenny. His 'mama' teaches him the ways of life and leaves him to choose his destiny. Forrest joins the army for service in Vietnam, finding new friends called Dan and Bubba, he wins medals, creates a famous shrimp fishing fleet, inspires people to jog, starts a ping-pong craze, create the smiley, write bumper stickers and songs, donating to people and meeting the president several times. However, this is all irrelevant to Forrest who can only think of his childhood sweetheart Jenny Curran. Who has messed up her life. Although in the end all he wants to prove is that anyone can love anyone.",
   'imdb': {'rating': 8.8, 'votes': 1087227, 'id': 109830},
   'year': 1994,
   'plot': 'Forrest Gump, while not intelligent, has accidentally been present at many historic moments, but his true love, Jenny Cu

## Output
The output will be an array of tuples where the first element of the tuple is the full document and the second element is the tf-idf score. 

As you can see, Forrest Gump returns the highest TF-IDF score
Finding Forrester returns the second highest TF-IDF score

"Forrest Gump" : 8.343408593803858
"Finding Forrester" :  4.021189299069938