# BIR retrieval with inverted files

Let's begin by setting up a straightforward scenario involving a handful of terms and assigning random document references to them. Our focus here is to demonstrate the use of inverted files for Boolean retrieval. We will explore two aspects: 1) employing set operators for query evaluation, and 2) utilizing streaming interfaces to efficiently retrieve postings. It's important to note that in this scenario, we are using document IDs instead of actual documents to keep things simple.

## Create inverted index
The next section generates random inverted index postings for a set of terms. It simulates the indexing process for Boolean retrieval by associating random document IDs with each term. The `vocabulary` dictionary defines terms and their desired document frequencies (as a %-figure). The generated postings are stored in the `index` dictionary, with each term mapped to a set of corresponding document IDs.

* `nDocs = 100`: Defines the total number of documents (document IDs) as 100.
* `index = {}`: Initializes an empty dictionary to store the postings for each term.
* `DEBUG = False`: A debug flag (we use it later to illustrate code execution).
* `vocabulary`: Defines a dictionary where each term is associated with its desired document frequency (expressed as a percentage).

`createPostings(term: str, docFreq: int = None)` takes a term (string) and an optional document frequency (docFreq, integer) as arguments. It generates random postings for the term by creating a set of document IDs. If docFreq is not provided, it generates a random document frequency between 1 and nDocs. The for-loop iterates through each term in the vocabulary dictionary and calls the createPostings function. For each term, it fetches the desired document frequency from the vocabulary (values are percentages) and passes it to createPostings.

In [14]:
import random

DEBUG = False
nDocs = 40
index = {}
documents = []

def createPostings(term: str, docFreq: int = None):
    # create random postings for the term for ids between 1 and nDocs
    if docFreq is None:
        docFreq = random.randint(1, nDocs)
    # create sets with random ids
    index[term] = set(random.sample(range(1, nDocs + 1), docFreq))
    # extend feature vectors for documents
    for doc in index[term]:
        documents[doc].add(term)

# define vocabulary and create random postings with given document frequency (in percents)
vocabulary = {
    'dog':       33,
    'cat':       28,
    'horse':     12,
    'rabit':     19,
    'ostrich':   15,
    'bear':      14,
    'tiger':     17,
    'lion':      15,
    'bird':      18
}

# set all feature vectors of documents to empty. We use sets since BIR uses set-of-word model
for doc in range(nDocs + 1):
    documents.append(set())

# call createPostings for each entry in vocabulary to create the inverted index
for word in vocabulary:
    createPostings(word, vocabulary[word] * nDocs // 100)

Let's have a look at the postings for each term:

In [15]:
# print postings with term and list of documents
for term, posting in index.items():
    # format: term + doc_list as array; padd term to 15 characters
    print(term.ljust(10), sorted(posting))

dog        [1, 2, 7, 10, 14, 17, 19, 22, 24, 30, 37, 38, 40]
cat        [3, 11, 20, 21, 22, 25, 32, 35, 36, 38, 40]
horse      [4, 10, 14, 21]
rabit      [5, 6, 22, 28, 35, 38, 39]
ostrich    [12, 25, 27, 28, 31, 32]
bear       [5, 6, 17, 37, 40]
tiger      [2, 17, 23, 25, 34, 37]
lion       [9, 11, 14, 17, 21, 38]
bird       [4, 10, 14, 19, 27, 31, 39]


In [16]:
# print a few documents
for doc in range(40):
    print(doc + 1, documents[doc + 1])

1 {'dog'}
2 {'tiger', 'dog'}
3 {'cat'}
4 {'horse', 'bird'}
5 {'bear', 'rabit'}
6 {'bear', 'rabit'}
7 {'dog'}
8 set()
9 {'lion'}
10 {'horse', 'dog', 'bird'}
11 {'lion', 'cat'}
12 {'ostrich'}
13 set()
14 {'horse', 'lion', 'dog', 'bird'}
15 set()
16 set()
17 {'bear', 'lion', 'tiger', 'dog'}
18 set()
19 {'dog', 'bird'}
20 {'cat'}
21 {'horse', 'cat', 'lion'}
22 {'cat', 'dog', 'rabit'}
23 {'tiger'}
24 {'dog'}
25 {'tiger', 'ostrich', 'cat'}
26 set()
27 {'ostrich', 'bird'}
28 {'ostrich', 'rabit'}
29 set()
30 {'dog'}
31 {'ostrich', 'bird'}
32 {'ostrich', 'cat'}
33 set()
34 {'tiger'}
35 {'cat', 'rabit'}
36 {'cat'}
37 {'bear', 'tiger', 'dog'}
38 {'lion', 'cat', 'dog', 'rabit'}
39 {'bird', 'rabit'}
40 {'bear', 'cat', 'dog'}


### Class Term

This class implements a retriever for an atomic term query:
- `__iter__(self)`: provides an iterator interface for Term to simplify enumeration of results; we use the `retrieve`-generator for that
- `retrieve(self)`: implements the generator function enumerating all postings for the term from the index. If `DEBUG = True`, it prints the next posting so we can observe the evaluation later on

Note: This class is intentionally kept basic for illustrative purposes. It doesn't involve file reading; rather, it relies on the global `index` object. If needed, we can effortlessly replace the for-loop with file reading operations. However, this change introduces complexities because terms may appear multiple times in a Boolean query (e.g., "cat AND dog OR cat AND horse"). Preventing duplicate file reads demands additional buffering logic. Furthermore, for data efficiency, it's advisable to apply compression techniques to reduce the data volume.

In [21]:
class BIRPosting:
    """
        Stream of postings for BIR retrieval. Iterator/generator returns only the document IDs.
        Method weight provides the BIR weights that can be adjusted with relevance feedback
    """
    def __init__(self, term: str, feedback = {}):
        self.term = term
        self.docFreq = len(index[term])
        self.weight = (self.docFreq + 1) / (nDocs + 1)

    def __iter__(self):
        return self.retrieve()

    def retrieve(self):
        for posting in sorted(index[self.term]):
            if DEBUG:
                print(self.term, posting)
            yield posting

In [22]:
# print postings with term and list of documents
for term, posting in index.items():
    # format: term + doc_list as array; padd term to 15 characters
    print(term.ljust(10), BIRPosting(term).weight, sorted(posting))

dog        0.34146341463414637 [1, 2, 7, 10, 14, 17, 19, 22, 24, 30, 37, 38, 40]
cat        0.2926829268292683 [3, 11, 20, 21, 22, 25, 32, 35, 36, 38, 40]
horse      0.12195121951219512 [4, 10, 14, 21]
rabit      0.1951219512195122 [5, 6, 22, 28, 35, 38, 39]
ostrich    0.17073170731707318 [12, 25, 27, 28, 31, 32]
bear       0.14634146341463414 [5, 6, 17, 37, 40]
tiger      0.17073170731707318 [2, 17, 23, 25, 34, 37]
lion       0.17073170731707318 [9, 11, 14, 17, 21, 38]
bird       0.1951219512195122 [4, 10, 14, 19, 27, 31, 39]


In [45]:
class BIRModel_DAAT:
    """
        Document at a time
    """
    def __init__(self, terms):
        self.terms = [BIRPosting(term) for term in terms]
        self.visited = set()
        self.relevant = set()
    
    def __iter__(self):
        return self.retrieve()

    def _score(self, indexes: list):
        score = 0
        for i in indexes:
            score += self.terms[i].weight
        return score

    def retrieve(self):
        iters = [iter(e) for e in self.terms]
        nexts = [next(e, None) for e in iters]
        scored_docs = []

        while not all(e is None for e in nexts):
            # get smallest value from nexts, ignoring None values
            smallest = min(nexts, key = lambda x: x if x is not None else float('inf'))
            terms = [index for index in range(len(nexts)) if nexts[index] == smallest]
            print(smallest, nexts, terms)
            scored_docs.append((smallest, self._score(terms)))
            # for each entry in nexts, fetch next item if entry equals smallest
            for i, e in enumerate(nexts):
                if e is smallest:
                    nexts[i] = next(iters[i], None)
        
        # sort result and yield all
        print(scored_docs)
        for item in sorted(scored_docs, key = lambda x: -x[1]):
            yield item

### Sample queries and comparison with set-based implementation

We compute the same queries as above, but this time constructing them with the classes defined above. Using iterators and generators greatly simplifies the evaluation queries. Although, in all the examples below we fetch all results, we will see in the next section that we truly generate results with minimal efforts.

Assertion verify that we have correctly implemented operator evaluations. Do we still have bugs in the implementation?

In [46]:
DEBUG = False

# AND operator
query = ['cat', 'dog']
bir = BIRModel_DAAT(query)
print(' '.join(query).rjust(45), list(bir))

1 [3, 1] [1]
2 [3, 2] [1]
3 [3, 7] [0]
7 [11, 7] [1]
10 [11, 10] [1]
11 [11, 14] [0]
14 [20, 14] [1]
17 [20, 17] [1]
19 [20, 19] [1]
20 [20, 22] [0]
21 [21, 22] [0]
22 [22, 22] [0, 1]
24 [25, 24] [1]
25 [25, 30] [0]
30 [32, 30] [1]
32 [32, 37] [0]
35 [35, 37] [0]
36 [36, 37] [0]
37 [38, 37] [1]
38 [38, 38] [0, 1]
40 [40, 40] [0, 1]
[(1, 0.34146341463414637), (2, 0.34146341463414637), (3, 0.2926829268292683), (7, 0.34146341463414637), (10, 0.34146341463414637), (11, 0.2926829268292683), (14, 0.34146341463414637), (17, 0.34146341463414637), (19, 0.34146341463414637), (20, 0.2926829268292683), (21, 0.2926829268292683), (22, 0.6341463414634146), (24, 0.34146341463414637), (25, 0.2926829268292683), (30, 0.34146341463414637), (32, 0.2926829268292683), (35, 0.2926829268292683), (36, 0.2926829268292683), (37, 0.34146341463414637), (38, 0.6341463414634146), (40, 0.6341463414634146)]
                                      cat dog [(22, 0.6341463414634146), (38, 0.6341463414634146), (40, 0.6341463

### Magic generators

Generators are great to prevent evaluation of results that are not needed. Assume the user is querying with "(cat OR dog) AND NOT(horse OR bird)" which generates a lot of results. Rather than returning hundreds of results at once, a user may want to browse through results page-by-page. Our generates exactly do this; even more, we only read postings that we need to produce the results for each batch returned to the users as they browse through results.

Let's verify this and set `DEBUG = True`. Every time we fetch a posting, the class Term is printing a line with the term and the next posting. The code below first fetches 5 results, and then, as we imagine that the user moves to the next page, fetches the next 5 results. From the produced output, we see that the evaluation indeed only reads postings as needed.

In [43]:
from itertools import islice
DEBUG = True

expr = And(Or(Term('cat'), Term('dog')), Not(Or(Term('horse'), Term('bird'))))
result = expr.retrieve()

print("retrieving first 5 documents for (cat OR dog) AND NOT(horse OR bird)")
print(list(islice(result, 5)))

print("\nretrieving next 5 documents")
print(list(islice(result, 5)))
    

NameError: name 'And' is not defined

### What's next
We could extend the code to parse query strings and produce the expressions necessary for the evaluation. We could process real documents, create a document and index dictionary to show real text retrieval. 