# Boolean retrieval with inverted files

Let's begin by setting up a straightforward scenario involving a handful of terms and assigning random document references to them. Our focus here is to demonstrate the use of inverted files for Boolean retrieval. We will explore two aspects: 1) employing set operators for query evaluation, and 2) utilizing streaming interfaces to efficiently retrieve postings. It's important to note that in this scenario, we are using document IDs instead of actual documents to keep things simple.

## Create inverted index
The next section generates random inverted index postings for a set of terms. It simulates the indexing process for Boolean retrieval by associating random document IDs with each term. The `vocabulary` dictionary defines terms and their desired document frequencies (as a %-figure). The generated postings are stored in the `index` dictionary, with each term mapped to a set of corresponding document IDs.

* `nDocs = 100`: Defines the total number of documents (document IDs) as 100.
* `index = {}`: Initializes an empty dictionary to store the postings for each term.
* `DEBUG = False`: A debug flag (we use it later to illustrate code execution).
* `vocabulary`: Defines a dictionary where each term is associated with its desired document frequency (expressed as a percentage).

`createPostings(term: str, docFreq: int = None)` takes a term (string) and an optional document frequency (docFreq, integer) as arguments. It generates random postings for the term by creating a set of document IDs. If docFreq is not provided, it generates a random document frequency between 1 and nDocs. The for-loop iterates through each term in the vocabulary dictionary and calls the createPostings function. For each term, it fetches the desired document frequency from the vocabulary (values are percentages) and passes it to createPostings.

In [101]:
import random

DEBUG = False
nDocs = 100
index = {}

def createPostings(term: str, docFreq: int = None):
    # create random postings for the term for ids between 1 and nDocs
    if docFreq is None:
        docFreq = random.randint(1, nDocs)
    # create sets with random ids
    index[term] = set(random.sample(range(1, nDocs + 1), docFreq))

# define vocabulary and create random postings with given document frequency (in percents)
vocabulary = {
    'dog':       33,
    'cat':       28,
    'horse':     12,
    'rabit':     9,
    'ostrich':   5,
    'bear':      4,
    'tiger':     7,
    'lion':      5,
    'bird':      18
}

# call createPostings for each entry in vocabulary to create the inverted index
for word in vocabulary:
    createPostings(word, vocabulary[word] * nDocs // 100)

Let's have a look at the postings for each term:

In [102]:
# print postings with term and list of documents
for term, posting in index.items():
    # format: term + doc_list as array; padd term to 15 characters
    print(term.ljust(10), sorted(posting))

dog        [1, 2, 6, 10, 11, 14, 15, 16, 23, 24, 26, 28, 30, 34, 35, 39, 41, 43, 46, 51, 55, 58, 66, 67, 73, 79, 80, 82, 87, 91, 92, 93, 99]
cat        [1, 4, 7, 9, 16, 20, 26, 28, 30, 35, 41, 42, 43, 47, 58, 65, 66, 70, 72, 76, 82, 84, 88, 90, 91, 92, 94, 98]
horse      [1, 7, 14, 36, 61, 65, 67, 84, 87, 88, 89, 91]
rabit      [7, 8, 34, 45, 56, 60, 76, 78, 86]
ostrich    [13, 21, 23, 32, 56]
bear       [31, 38, 82, 100]
tiger      [18, 21, 32, 34, 36, 64, 69]
lion       [15, 31, 33, 47, 56]
bird       [1, 7, 21, 22, 31, 33, 34, 37, 43, 48, 62, 68, 71, 72, 73, 90, 91, 97]


## Perform set operations to answer boolean queries
The next section demonstrates how to perform Boolean queries using the inverted index with set operations. It showcases simple queries, AND, OR, AND-NOT operations, complex queries in disjunctive normal form, and arbitrary queries with NOT operations. The results of these operations on the posting sets are printed for each query scenario.

| Boolean Operator | Set Operator for Postings |
| :--- | :--- |
| cat AND dog | `index['cat'] & index['dog']` |
| cat OR dog | `index['cat'] \| index['dog']` |
| cat AND NOT dog | `index['cat'] - index['dog']` |

Using these rules, we can evaluate a wide range of Boolean queries. However, there are some limitations:
- We cannot evaluate OR-queries when one sub-expression is of the form NOT(expr). While it's technically possible to construct NOT(expr) by using all documents except those returned by expr, this approach becomes inefficient for large collections.
- In AND-queries, NOT(expr)-parts need to be re-ordered to the end to apply the `-` set operator. Additionally, at least one element of the AND-query must not be in the form NOT(expr).

Indeed, while these limitations may be viewed as constraints in our implementation, they have minimal impact on practical querying scenarios. Queries like "cat OR NOT dog" don't often align with typical search intentions, as they essentially select all documents except those with the condition "dog but not cat", i.e., it can be rephrased as "NOT(dog AND NOT cat)". 

**Extra challenge: add a parser for Boolean expression and evaluate queries dynamically**

In [103]:
# lets do simple queries by hand for cats, dogs, horses, and birds
cat = index['cat']
dog = index['dog']
horse = index['horse']
bird = index['bird']
for term in ['cat', 'dog', 'horse', 'bird']:
    print(term.rjust(45), sorted(index[term]))
print()

# AND operator
query = 'cat AND dog'
result = cat & dog
print(query.rjust(45),sorted(result))

# OR operator
query = 'cat OR dog'
result = cat | dog
print(query.rjust(45),sorted(result))

query = 'horse OR bird'
result = horse | bird
print(query.rjust(45),sorted(result))

# AND-NOT operator
query = 'cat AND  NOT dog'
result = cat - dog
print(query.rjust(45),sorted(result))

query = 'horse AND cat AND NOT bird'
result = horse & cat - bird
print(query.rjust(45),sorted(result))

# disjunctive normal form
print()
query = '(cat AND dog) OR (horse AND cat AND NOT bird)'
result = (cat & dog) | (horse & cat - bird)
print(query.rjust(45),sorted(result))

# arbitrary queries
query = '(cat OR dog) AND (horse OR bird)'
result = (cat | dog) & (horse | bird)
print(query.rjust(45),sorted(result))

query = '(cat OR dog) AND NOT(horse OR bird)'
result = (cat | dog) - (horse | bird)
print(query.rjust(45),sorted(result))

                                          cat [1, 4, 7, 9, 16, 20, 26, 28, 30, 35, 41, 42, 43, 47, 58, 65, 66, 70, 72, 76, 82, 84, 88, 90, 91, 92, 94, 98]
                                          dog [1, 2, 6, 10, 11, 14, 15, 16, 23, 24, 26, 28, 30, 34, 35, 39, 41, 43, 46, 51, 55, 58, 66, 67, 73, 79, 80, 82, 87, 91, 92, 93, 99]
                                        horse [1, 7, 14, 36, 61, 65, 67, 84, 87, 88, 89, 91]
                                         bird [1, 7, 21, 22, 31, 33, 34, 37, 43, 48, 62, 68, 71, 72, 73, 90, 91, 97]

                                  cat AND dog [1, 16, 26, 28, 30, 35, 41, 43, 58, 66, 82, 91, 92]
                                   cat OR dog [1, 2, 4, 6, 7, 9, 10, 11, 14, 15, 16, 20, 23, 24, 26, 28, 30, 34, 35, 39, 41, 42, 43, 46, 47, 51, 55, 58, 65, 66, 67, 70, 72, 73, 76, 79, 80, 82, 84, 87, 88, 90, 91, 92, 93, 94, 98, 99]
                                horse OR bird [1, 7, 14, 21, 22, 31, 33, 34, 36, 37, 43, 48, 61, 62, 65, 67, 68, 71, 72, 73, 84

## Illustration of efficient, stream based evaluation

The set-based evaluation from above does not scale well with the number of documents. In cases with millions of billions of postings for a term, we want to fetch data from an external storage device (which is also a good idea for peristence). But instead of reading all postings into main memory, we read them as streams sorted by the document IDs. Take the postings of cat and dog as an example:
|term | postings|
|:-- | :-- |
| cat | `[1, 4, 8, 10]` |
| dog | `[3, 4, 10, 12]` |

To evaluate a query like "cat AND dog", we fetch the first entry for each term, that is `1` for cat and `3` for dog. If they are the same, we know that the corresponding document fulfills the predicate. If not, then we read the next entry for the term currently having the smallest doc ID. In our example, we read the next cat posting which is `4`. Again, we have no match, so we progress now postings of dog as it currently has the smallest value. The next psoting for dog is `4`which matches with the one of cat; hence, we found our first document and return it (in Python we use generators with `yield` as we also do not want to return all results at once but in batches as the user browses through pages). If we need more results to return, we now fetch the next posting for both terms and repeat. Finally, we find `10` and return it as a secon answer. If we need more results, we again fetch the next posting for both terms. But since cat does not have more postings, we can terminate the evaluation and stop iteration (dog still has `12` but we already know from the empty cat postings that it cannot match the query). The following visualizes the approach:

|step|cat (next) |dog (next) | action|
|:-- |:-- |:-- |:-- |
| 1 | `1` | `3` | no match, progress cat |
| 2 | `4` | `3` | no match, progress dog |
| 3 | `4` | `4` | match, return `4` as result, wait to provide next result, and progress both cat and dog |
| 4 | `8` | `10` | no match, progress cat |
| 5 | `10` | `10` | match, return `10` as result, wait to provide next result, and progress both cat and dog |
| 6 | - | `12` | stop iteration as all cat postings are visited; remaining postings in dog cannot fulfill predicate |

The OR-operator is implemented similarly, but the iteration returns every time the smallest entry from a sub-expression. For the example above, the OR-operator would first return `1`, progress cat, return `3`, progress dog, return `4`, progress both cat and dog, return `8`, progress cat, return `10`, progress both cat and dog, and finally return `12`. The evaluation stops when all postings are consumed.

The "cat AND NOT dog" evaluation progress is the same as with the AND flow but results are different (match if cat != dog):
|step|cat (next) |dog (next) | action|
|:-- |:-- |:-- |:-- |
| 1 | `1` | `3` | match, return `1` as result, wait to provide next result, and progress cat |
| 2 | `4` | `3` | match but cat is not smallest, so we progress dog |
| 3 | `4` | `4` | no match as both have the same value, so we progress both cat and dog |
| 4 | `8` | `10` | no match, return `8` as result, wait to provide next result, and progress cat  |
| 5 | `10` | `10` | no match as both have the same value, so we progress both cat and dog |
| 6 | - | `12` | stop iteration as all cat postings are visited; remaining postings in dog cannot fulfill predicate |



### Class Term

This class implements a retriever for an atomic term query:
- `__iter__(self)`: provides an iterator interface for Term to simplify enumeration of results; we use the `retrieve`-generator for that
- `retrieve(self)`: implements the generator function enumerating all postings for the term from the index. If `DEBUG = True`, it prints the next posting so we can observe the evaluation later on

Note: This class is intentionally kept basic for illustrative purposes. It doesn't involve file reading; rather, it relies on the global `index` object. If needed, we can effortlessly replace the for-loop with file reading operations. However, this change introduces complexities because terms may appear multiple times in a Boolean query (e.g., "cat AND dog OR cat AND horse"). Preventing duplicate file reads demands additional buffering logic. Furthermore, for data efficiency, it's advisable to apply compression techniques to reduce the data volume.

In [104]:
class Term:
    """
        Retriever class for atomic term queries
    """
    def __init__(self, term: str):
        self.term = term

    def __iter__(self):
        return self.retrieve()

    def retrieve(self):
        for posting in sorted(index[self.term]):
            if DEBUG:
                print(self.term, posting)
            yield posting

### Class Not

This is a simple marker class that the And/Or classes are using to implement (or reject) NOT-expressions. Both iterator and generator functions raise an error when invoked to prevent top-level query evaluation of NOT-expressions.

In [105]:
class Not:
    """
        Marker class for NOT operator on sub-expression. The retrieve method raises an exception.
        When used during AND operation, the retrieve method of the sub-expression is called.
    """
    def __init__(self, expression):
        self.expression = expression
    
    def __iter__(self):
        return self.retrieve()

    def retrieve(self):
        raise Exception("NOT operator not allowed at top-level of query")

### Class And
This class implements (arbitrary) AND-expression with multiple sub-expressions (can be of any type). This class can handle NOT(expr)-type         subexpressions and implements the correct '-' semantics of "cat AND NOT dog":
- `__iter__(self)`: provides an iterator interface for AND -expression to simplify enumeration of results; we use the `retrieve`-generator for that
- `retrieve(self)`: implements the generator function enumerating all postings for the term from the index

The implementation is kept simple to illustrate the algorithm, and further optimizations are feasible. The implementation first differentiates between positive and negative expressions (for lack of better words). Positive means sub-expressions without top-level NOT-operator, and negative means sub-expressions with a top-level NOT-operator. The `pos_next` and `neg_next` lists contain the next posting for each of the corresponding sub-expressions. Values are `yield`-ed if all 'positive' expressions equal the smallest value over all sub-expressions, and if none of the 'negative' sub-expressions equals that smallest value. The final 2 for-loops progress sub-expressions that have the smallest value as their next value.

In [106]:
class And:
    """
        AND-expression with multiple sub-expressions. This operator can handle NOT(expr)-type 
        subexpressions and implements the correct '-' semantics of "cat AND NOT dog".
    """
    def __init__(self, *expressions):
        # select expressison that are not Term or that have x._not = False
        self.pos = [e for e in expressions if not isinstance(e, Not)]
        self.neg = [e for e in expressions if isinstance(e, Not)]

    def __iter__(self):
        return self.retrieve()

    def retrieve(self):
        pos_stream = [e.retrieve() for e in self.pos]
        pos_next = [next(e) for e in pos_stream]
        neg_stream = [iter(e.expression.retrieve()) for e in self.neg]
        neg_next = [next(e, None) for e in neg_stream]

        # iterate until one pos_next element is None
        while None not in pos_next:
            # get smallest value from pos_next and neg_next, ignoring None values in neg_next
            smallest = min(pos_next + neg_next, key=lambda x: x if x is not None else float('inf'))
            # check if all entries of pos_next equal smallest, and no entry in neg_next equals smallest
            if all(e is smallest for e in pos_next) and smallest not in neg_next:
                yield smallest
            # for each entry in pos_next and neg_next, fetch next item if entry equals smallest
            for i, e in enumerate(pos_next):
                if e is smallest:
                    pos_next[i] = next(pos_stream[i], None)
            for i, e in enumerate(neg_next):
                if e is smallest:
                    neg_next[i] = next(neg_stream[i], None)

### Class Or
This class implements (arbitrary) OR-expression with multiple sub-expressions (can be of any type). This class **cannot** handle NOT(expr)-type         subexpressions and raises an error if a sub-expression has a top-level NOT-operator:
- `__iter__(self)`: provides an iterator interface for OR-expression to simplify enumeration of results; we use the `retrieve`-generator for that
- `retrieve(self)`: implements the generator function enumerating all postings for the term from the index

The implementation is kept simple to illustrate the algorithm, and further optimizations are feasible. The implementation `yield`s the smallest values found as next posting in its sub-expressions. Then it progresses sub-expressions with that smallest value as their next element. The iteration stops once all sub-expressions are exhausted.

In [107]:
class Or:
    """
        OR-expression with multiple sub-expressions. This operator cannot handle NOT(expr)-type subexpressions
    """
    def __init__(self, *expressions):
        # check that there are no NOT(expr)-type subexpressions
        if any(isinstance(e, Not) for e in expressions):
            raise Exception("OR-expression cannot handle NOT(expr)-type subexpressions")
        self.expressions = expressions
    
    def __iter__(self):
        return self.retrieve()

    def retrieve(self):
        iters = [iter(e) for e in self.expressions]
        nexts = [next(e, None) for e in iters]

        while not all(e is None for e in nexts):
            # get smallest value from nexts, ignoring None values
            smallest = min(nexts, key=lambda x: x if x is not None else float('inf'))
            yield smallest
            # for each entry in pos_next and neg_next, fetch next item if entry equals smallest
            for i, e in enumerate(nexts):
                if e is smallest:
                    nexts[i] = next(iters[i], None)

### Sample queries and comparison with set-based implementation

We compute the same queries as above, but this time constructing them with the classes defined above. Using iterators and generators greatly simplifies the evaluation queries. Although, in all the examples below we fetch all results, we will see in the next section that we truly generate results with minimal efforts.

Assertion verify that we have correctly implemented operator evaluations. Do we still have bugs in the implementation?

In [108]:
DEBUG = False

# AND operator
query = 'cat AND dog'
expr = And(Term('cat'), Term('dog'))
print(query.rjust(45), sorted(expr))
assert sorted(expr) == sorted(cat & dog)

# OR operator
query = 'cat OR dog'
expr =  Or(Term('cat'), Term('dog'))
print(query.rjust(45),sorted(expr))
assert sorted(expr) == sorted(cat | dog)

query = 'horse OR bird'
expr = horse | bird
print(query.rjust(45),sorted(expr))
assert sorted(expr) == sorted(horse | bird)

# AND-NOT operator
query = 'cat AND NOT dog'
expr = And(Term('cat'), Not(Term('dog')))
print(query.rjust(45),sorted(expr))
assert sorted(expr) == sorted(cat - dog)

query = 'horse AND cat AND NOT bird'
expr = And(Term('horse'), Term('cat'), Not(Term('bird')))
print(query.rjust(45),sorted(expr))
assert sorted(expr) == sorted(horse & cat - bird)

# disjunctive normal form
print()
query = '(cat AND dog) OR (horse AND cat AND NOT bird)'
expr = Or(And(Term('cat'), Term('dog')), And(Term('horse'), Term('cat'), Not(Term('bird'))))
print(query.rjust(45),sorted(expr))
assert sorted(expr) == sorted((cat & dog) | (horse & cat - bird))

# arbitrary queries
query = '(cat OR dog) AND (horse OR bird)'
expr = And(Or(Term('cat'), Term('dog')), Or(Term('horse'), Term('bird')))
print(query.rjust(45),sorted(expr))
assert sorted(expr) == sorted((cat | dog) & (horse | bird))

query = '(cat OR dog) AND NOT(horse OR bird)'
expr = And(Or(Term('cat'), Term('dog')), Not(Or(Term('horse'), Term('bird'))))
print(query.rjust(45),sorted(expr))
assert sorted(expr) == sorted((cat | dog) - (horse | bird))

                                  cat AND dog [1, 16, 26, 28, 30, 35, 41, 43, 58, 66, 82, 91, 92]
                                   cat OR dog [1, 2, 4, 6, 7, 9, 10, 11, 14, 15, 16, 20, 23, 24, 26, 28, 30, 34, 35, 39, 41, 42, 43, 46, 47, 51, 55, 58, 65, 66, 67, 70, 72, 73, 76, 79, 80, 82, 84, 87, 88, 90, 91, 92, 93, 94, 98, 99]
                                horse OR bird [1, 7, 14, 21, 22, 31, 33, 34, 36, 37, 43, 48, 61, 62, 65, 67, 68, 71, 72, 73, 84, 87, 88, 89, 90, 91, 97]
                              cat AND NOT dog [4, 7, 9, 20, 42, 47, 65, 70, 72, 76, 84, 88, 90, 94, 98]
                   horse AND cat AND NOT bird [65, 84, 88]

(cat AND dog) OR (horse AND cat AND NOT bird) [1, 16, 26, 28, 30, 35, 41, 43, 58, 65, 66, 82, 84, 88, 91, 92]
             (cat OR dog) AND (horse OR bird) [1, 7, 14, 34, 43, 65, 67, 72, 73, 84, 87, 88, 90, 91]
          (cat OR dog) AND NOT(horse OR bird) [2, 4, 6, 9, 10, 11, 15, 16, 20, 23, 24, 26, 28, 30, 35, 39, 41, 42, 46, 47, 51, 55, 58, 66, 70

### Magic generators

Generators are great to prevent evaluation of results that are not needed. Assume the user is querying with "(cat OR dog) AND NOT(horse OR bird)" which generates a lot of results. Rather than returning hundreds of results at once, a user may want to browse through results page-by-page. Our generates exactly do this; even more, we only read postings that we need to produce the results for each batch returned to the users as they browse through results.

Let's verify this and set `DEBUG = True`. Every time we fetch a posting, the class Term is printing a line with the term and the next posting. The code below first fetches 5 results, and then, as we imagine that the user moves to the next page, fetches the next 5 results. From the produced output, we see that the evaluation indeed only reads postings as needed.

In [109]:
from itertools import islice
DEBUG = True

expr = And(Or(Term('cat'), Term('dog')), Not(Or(Term('horse'), Term('bird'))))
result = expr.retrieve()

print("retrieving first 5 documents for (cat OR dog) AND NOT(horse OR bird)")
print(list(islice(result, 5)))

print("\nretrieving next 5 documents")
print(list(islice(result, 5)))
    

retrieving first 5 documents for (cat OR dog) AND NOT(horse OR bird)
cat 1
dog 1
horse 1
bird 1
cat 4
dog 2
horse 7
bird 7
dog 6
cat 7
dog 10
cat 9
horse 14
bird 21
cat 16
[2, 4, 6, 9, 10]

retrieving next 5 documents
dog 11
dog 14
dog 15
horse 36
dog 16
cat 20
dog 23
cat 26
bird 22
bird 31
[11, 15, 16, 20, 23]


### What's next
We could extend the code to parse query strings and produce the expressions necessary for the evaluation. We could process real documents, create a document and index dictionary to show real text retrieval. 