# Boolean Model from Scratch

### Boolean model is used in information retrieval systems to retrieve relevant documents from the corpus of documents. Query is fed into the model, and using the set operations based on the query model evaluates it and return back the relevant documents.

*make_doc()*  function initializes the corpus of the boolean model. Here we are taking six documents for the demonstration purpose. <br>
doc 1: MS Dhoni <br>
doc 2: Persistent Systems <br>
doc 3: Indian Army <br>
doc 4: Question Answering Systems <br>
doc 5: GATE <br>
doc 6: Internals Best Talk Show <br>

In [3]:
def make_doc():
    doc1 = "MS Dhoni is former Indian Cricketer and plays in IPL"
    doc2 = "Persistent systems is the only software company which comes for placements"
    doc3 = "Personnel who serve in the Para (SF) are allowed to wear the Balidan (Sacrifice) patch on their right pocket"
    doc4 = "Question Answering System can pull of answers from unstructured collection of natural language"
    doc5 = "It is an examination that primarily tests the comprehensive understanding of various undergraduate subjects"
    doc6 = "Internals is the best talk show youtube have ever hosted"
    return (doc1, doc2, doc3, doc4, doc5, doc6)

Importing required libraries for preprocessing of documents

In [4]:
import re
import string
import numpy as np
import pandas as pd
from collections import deque
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import SpaceTokenizer
from nltk.tokenize import TweetTokenizer

*process_doc()* function reduces a document into list of words after preprocessing it with stemmer and removing stopwords

In [5]:
def process_doc(doc):
    """Process doc function.
    Input:
        doc: a string containing a information
    Output:
        doc_clean: a list of words containing the processed doc

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    doc_tokens = tokenizer.tokenize(doc)

    doc_clean = []
    for word in doc_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # doc_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            doc_clean.append(stem_word)

    return doc_clean

This function *process_query()* is used to get the list of words in the query after preprocessing on it

In [6]:
def process_query(query):
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    doc_tokens = tokenizer.tokenize(query)

    doc_clean = []
    for word in doc_tokens:
        if (word not in string.punctuation):  # remove punctuation
            # doc_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            doc_clean.append(stem_word)

    return doc_clean

*build_bag()* function is useful to collect list of all word in all documents.

* Build bag function returns back the list of words present in the corpus, documents and peprocessed documents
* These returned values are then useful to make the dataframe of the corpus 

In [7]:
def build_bag():
    word_bag = []
    docs_clean = []
    docs = make_doc()
    for doc in make_doc():
        doc = process_doc(doc)
        for word in doc:
            if word not in word_bag:
                word_bag.append(word)
            else:
                pass
        docs_clean.append(doc)
    return word_bag, docs, docs_clean

*make_data()* function is the driver function for making dataframe out of the corpus

In [8]:
def make_data():
    data = {}
    word_bag, docs, docs_clean = build_bag()
    for i in range(len(make_doc())):
        data[i] = []
        for word in word_bag:
            if word in docs_clean[i]:
                data[i].append(1)
            else:
                data[i].append(0)
    return data, word_bag

In [10]:
data, labels = make_data()
df = pd.DataFrame(data = data, index = labels)

*predict()* function resonates with the *boolean_model()* function and solves the **infix** of the query

In [11]:
def predict(query, df):
    query = process_query(query)
    stack = deque()
    flag = 0
    i = 0
    while i < len(query):
        word = query[i]
        if word in ['not', 'or', 'and']:
            if word == 'not':
                flag = 1
                op1 = list(df.loc[query[i + 1]] > 0)
                result = boolean_model(op1, word)
                stack.append(result)
            else:
                op1 = stack.pop()
                op2 = list(df.loc[query[i + 1]] > 0)
                result = boolean_model(op1, word, op2)
                stack.append(result)
            i += 1
        else:
            op1 = list(df.loc[word] > 0)
            stack.append(op1)
        i += 1
    output = stack.pop()
    result = []
    for i in range(len(output)):
        if output[i]:
            result.append('doc ' + str(i + 1))
    return result

*boolean_model()* is the model function to perform all the set operations.

### Inputs are 
* operand 1 = [1,0,0,0,1,0]
* operator = and
* operand 2 = [1,1,0,0,1,0]

### Output is
* similar list like operand 1

In [12]:
def boolean_model(op1, oper, op2 = []):
    if oper == 'not':
        op1 = [not i for i in list(df.loc['ms'] > 0)]
        return op1
    elif oper == 'and':
        for i in range(len(op1)):
            op1[i] = op1[i] and op2[i]
    else:
        for i in range(len(op1)):
            op1[i] = op1[i] or op2[i]
    return op1

Here we get the user input and feed it to the boolean model

* we get the result list with boolean values corresponding to the documents where query matches to

In [13]:
query = input()
result = predict(query, df)
if len(result) == 0:
    print('Query does not retrieve document')
else:
    for i in result:
        print('{} matches'.format(i))

ms and dhoni
doc 1 matches
