# Assignment 1 - _Foundations of Information Retrieval 2024_

This assignment is divided in 4 parts, which have to be delivered all together no later than 04/10/2023 at 23:59 (strict - no extensions will be granted!) via Canvas. Delivery of the assignment solutions is mandatory (_see grading conditions on Canvas and in slides of Lecture01_).

We will use [ElasticSearch](https://www.elastic.co/) as search engine. It provides state-of-the-art tools to implement your own engine, index your documents, and let you focus on methodological aspects of search models and optimization. 

The assignment is about text-based Information Retrieval and it is structured in three parts:
1. IR performance evaluation (implementation of performance metrics)
2. Setting up a search engine, pre-processing and indexing using ElasticSearch (Indexing, Analyzers)
3. Implementation and optimization of models of search (Similarity)


This assignment contains exercises, marked with the section title __Exercise 01.(x)__, which are evaluated, and other sections that contain support code which you should study and use as it is. Write your answers between the comments `BEGIN ANSWER` and `END ANSWER`. 

_Note:_ the comment `#THIS IS GRADED!` in a section indicates that it will be graded.


### Initial preparation (self-study)
For the first part, it is good to acquire (or refresh) basic knowledge of Python. Please use the [Python tutorials](https://docs.python.org/3/tutorial/) if needed.

For the second and third part of the assignment, please study yourself the [Getting Started guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html) of ElasticSearch and get acquainted with the framework.


***
***
***

# PART 01 - Performance evaluation


### Background information and reading
Study the slides of Lecture 01 (available on Canvas) and the reference book chapter (Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, [Chapter 8, Evaluation in information retrieval](http://nlp.stanford.edu/IR-book/pdf/08eval.pdf), Cambridge University Press. 2008)

### Basic concepts
Suppose the set of relevant documents (the document identifiers - _doc-IDs_) is called `relevant`, then we  define it as follows (in Python):

In [22]:
relevant = set([2, 3, 5, 8, 13, 17, 21, 34, 38])

A perfect run would retrieve exactly these 9 documents in any order. Now, suppose the list of retrieved documents (the document identifiers - _doc-IDs_) is called `retrieved`, and contains the following _doc-IDs_:

In [23]:
retrieved = [14, 4, 2, 18, 16, 8, 46, 32, 17, 34, 33, 22, 47, 39, 11]

One of the simplest evaluation measures is the _Success at rank 1_, i.e. `Is the first document retrieved a relevant document?`

_Success at rank 1_ returns 1 if the first document is relevant, and 0 otherwise. A possible implementation is: 

In [24]:
def success_at_1 (relevant, retrieved):
    if len(retrieved) > 0 and retrieved[0] in relevant:
        return 1
    else:
        return 0

success_at_1(relevant, retrieved)

0

The first retrieved documentid is 14 which is not in the set of relevant documents, so the `success_at_1` is 0.

_________________

> Note how easy it is to check if an item occurs in a Python set or list by using the keyword: `in`. Similarly, you can loop over all items in a set of list with: 
`for doc in retrieved:`, 
where doc will refer to each item in the set or list. 

Be sure to use the internet to sharpen your knowledge about Python constructs, for instance on [Python list slicing](https://duckduckgo.com/?q=python+list+slicing). Also note that the code above checks if at least one document is retrieved to avoid an index out of bounds exception (i.e. we avoid to access an empty vector).

> ___Suggestion:___ _to be sure of the correctness of the implementation of the performance metrics, you can compute their values manually and compare them with those computed by your functions. This is important, as you will use these metrics for later exercises and to compare the results of differentmodels._

## Preparation exercise: _Success at k_
The measure _Success at k_ returns 1 if a relevant document is among the first _k_ documents retrieved and zero otherwise.

> Success at _k_ measures are well-suited in case there is typically only one relevant document (or retrieving one relevant document is enough).

 __Implement _Success at 5_ below.__ 
 > The correct result is 1.

In [25]:
def success_at_5(relevant, retrieved):
    # BEGIN ANSWER
    for i in range(min(5, len(retrieved))):
        if retrieved[i] in relevant:
            return 1
    return 0
    # END ANSWER
    
success_at_5(relevant, retrieved)

1

Similarly __implement success at rank 10__

> The correct result is 1.

In [26]:
def success_at_10(relevant, retrieved):
    # BEGIN ANSWER
    for i in range(min(10, len(retrieved))):
        if retrieved[i] in relevant:
            return 1
    return 0
    # END ANSWER
    
success_at_10(relevant, retrieved)

1

## Exercise 01.A: _Precision, Recall and F-measure_
__1. Implement _Precision_ using Formula 8.1 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).__

>_Hint:_ one can count the number of documents in a list using the built-in Python function [len()](https://docs.python.org/3/library/functions.html#len) \
> _example:_ `len(retrieved)` for the number of retrieved documents. 

In [27]:
#THIS IS GRADED!

def precision(relevant, retrieved):
    # BEGIN ANSWER
    relCount = 0
    for i in range(len(retrieved)):
        if retrieved[i] in relevant:
            relCount += 1
    return relCount / len(retrieved)
    # END ANSWER
    
precision(relevant, retrieved)

0.26666666666666666

__2. Implement _Recall_ using Formula 8.2 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).__

In [28]:
#THIS IS GRADED!

def recall(relevant, retrieved):
    # BEGIN ANSWER
    relCount = 0
    for i in range(len(retrieved)):
        if retrieved[i] in relevant:
            relCount += 1
    return relCount / len(relevant)
    # END ANSWER
    
recall(relevant, retrieved)

0.4444444444444444

__3. Implement the balanced F measure (_F_ with β=1) using Formula 8.6 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).__

> Tip: you may reuse your implementations of precision and recall

In [29]:
#THIS IS GRADED!

def f_measure(relevant, retrieved):
    # BEGIN ANSWER
    beta = 1
    prec = precision(relevant, retrieved)
    rec = recall(relevant, retrieved)
    return (2 * prec * rec) / (prec + rec)
    # END ANSWER
    
f_measure(relevant, retrieved)

0.33333333333333337

## Exercise 01.B: _Precision at rank k_ and  _R-Precision_

Precision, Recall and F are _set_-based measures and suited for unranked lists of documents. If our search system returns a ranked _list_ of results, we can measure precision for several cut-off levels _k_ in the ranked list, i.e. we evaluate the relevance of the TOP-_k_ retrieved documents _(see lecture 01 slides and the book chapter)_. 


**1. Implement the function `precision_at_k()` that measures the precision at rank _k_**

> Interesting fact: For _k_=1, the _Precision at rank 1_ would be the samen as _Success at rank 1_ (why?) 

In [30]:
#THIS IS GRADED!

def precision_at_k(relevant, retrieved, k):
    # BEGIN ANSWER
    relCount = 0
    for i in range(min(len(retrieved), k)):
        if retrieved[i] in relevant:
            relCount += 1
    return relCount / len(retrieved)
    # END ANSWER

print('Pr@1: %1.2f' % precision_at_k(relevant, retrieved, k=1))
print('Pr@5: %1.2f' % precision_at_k(relevant, retrieved, k=5))
print('Pr@10: %1.2f' % precision_at_k(relevant, retrieved, k=10))


Pr@1: 0.00
Pr@5: 0.07
Pr@10: 0.27


__2. Implement R-Precision as defined in Chapter 8 of [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book)__.

In [46]:
#THIS IS GRADED!

def r_precision(relevant, retrieved):
    # BEGIN ANSWER
    relCount = 0
    for i in range(min(len(retrieved), len(relevant))):
        if retrieved[i] in relevant:
            relCount += 1
    return relCount / len(relevant)
    # END ANSWER
    
r_precision(relevant, retrieved)

0.3333333333333333

## Exercise 01.D:  Interpolated precision at _recall_ X

Another way to address ranked retrieval is to measure precision for several _recall_ levels _X_.

__Implement the function `interpolated_precision_at_recall_X()` that measures the interpolated precision at recall level _X_ as defined by formula 8.7 of [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).__

> Tip: calculate for each rank the recall. If the recall is greater than or equal to X, 
> calculate the precision. Keep the highest (maximum) precision of those to be returned at the end.

In [32]:
#THIS IS GRADED!

def interpolated_precision_at_recall_X (relevant, retrieved, X):
    # BEGIN ANSWER
    relFound = 0
    highestPrec = 0
    for i in range (len(retrieved)):
        if retrieved[i] in relevant:
            relFound += 1
        calcRecall = relFound / len(relevant)
        if calcRecall >= X:
            calcPrecision = relFound / (i + 1)
            highestPrec = max(highestPrec, calcPrecision)
    return highestPrec
    # END ANSWER
    
 

print('Pr_i@Re01: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.1))
print('Pr_i@Re02: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.2))
print('Pr_i@Re03: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.3))
print('Pr_i@Re04: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.4))
print('Pr_i@Re05: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.5))
print('Pr_i@Re06: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.6))
print('Pr_i@Re07: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.7))
print('Pr_i@Re08: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.8))
print('Pr_i@Re09: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.9))
print('Pr_i@Re10: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=1))

Pr_i@Re01: 0.40
Pr_i@Re02: 0.40
Pr_i@Re03: 0.40
Pr_i@Re04: 0.40
Pr_i@Re05: 0.00
Pr_i@Re06: 0.00
Pr_i@Re07: 0.00
Pr_i@Re08: 0.00
Pr_i@Re09: 0.00
Pr_i@Re10: 0.00


## Exercise 01.E:  _Average Precision_

For a single information need, _Average Precision_ is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved (see [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book), Pages 159 and 160). 

__Implement _Average Precision_ for a single information need.__

In [42]:
#THIS IS GRADED!

def average_precision(relevant, retrieved):
    # BEGIN ANSWER
    accumPrec = 0
    relCount = 0

    for i in range(len(retrieved)):
        if retrieved[i] in relevant:
            accumPrec += precision_at_k(relevant, retrieved, i + 1)
            relCount += 1

    if relCount == 0:
        return 0
    else:
        return accumPrec / relCount
    # END ANSWER

average_precision(relevant, retrieved)

0.16666666666666669

***
## Performance measures in TREC benchmarks

The relevance judgments are provided by TREC in so-called _"qrels"_ files that look as follows:

    1000 Q0 1341 1
    1000 Q0 1231 0
    1001 Q0 12332 1
     ...

The columns of the _qrels_ file contain:
1. the query identifier
2. the query number within that topic (currently unused and should always be Q0)
3. the document identifier that was examined by the judges
4. the relevance of the document (_1_:relevant; _0_: not relevant).

Below we provide some Python code that reads the _qrels_ and the _run_. The qrels will be put in the Python dictionary `all_relevant`. A [Python dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) provides quick lookup of a set of values given a key. We will use the `query_id` as a key, and a [Python set](https://docs.python.org/3/tutorial/datastructures.html#sets) of relevant document identifiers. For the partial qrels file above, `all_relevant` would look as follows:

    {
        "1000": set(["1341", "1231"]),
        "1001": set(["12332"])
    }
    
We will use a dictionary called `all_retrieved` with `query_id` as key, and as value a [Python list](https://docs.python.org/3/tutorial/introduction.html#lists) of document identifiers retrieved by the IR system:

    {
        "1000": ["1341", "12346, "2345"],
        "1001": [..., ..., ...],
        ...
    }

Note that, with this data structure, for each `query_id` we can easily access the list of retrieved and relevant documents, and compute the performance metrics. We can then average these measures over all the queries to compute the mean performance of the IR system on the given retrieval task.

Please examine the code below, and make sure you understand every line. Use the Python documentation where needed.

### DATA: the TREC genomics benchmark

For the following exercises, we will use a subset of the TREC genomics document collection and queries. 
It is stored in the folder `data01/` in the directory where you have been instructed to place the assignment notebooks (`/`).

The collections contains:

* `FIR-s05-medline.json` (the collection in Elasticsearch batch format - because of its size it cannot be indexed with a single curl command!)
* `FIR-s05-training-queries-simple.txt` (test queries)
* `FIR-s05-training-qrels.txt` (the "relevance judgements" for the test queries, i.e. the correct answers)

> ___Note___ that these files contain a subset of the documents and queries of the TREC genomics track benchmark, to facilitate experimentations with less computation time needed.
> The original files are also included in the `data01/` directory, withouth the `FIR-s05-` prefix (you may use them for the final project).

To make things easy, the data is already provided in Elasticsearch' batch processing format. 
Inspect the collection file in the terminal:

`head FIR-s05-medline.json`

This shows the first 5 documents in the collection (in JSON format prepared for ElasticSearch, as you have seen in the tutorial)

#### Baseline model and results
We also provide the list of retrieved documents by a _baseline_ model, in the file `data01/baseline.run`. For each query, it contains the list of document IDs of the retrieved documents (to be compared with those in the qrels file). We use this file in the examples and evaluation exercises below. 

In [34]:
def read_qrels_file(qrels_file):  # reads the content of he qrels file
    trec_relevant = dict()  # query_id -> set([docid1, docid2, ...])
    with open(qrels_file, 'r') as qrels:
        for line in qrels:
            (qid, q0, doc_id, rel) = line.strip().split()
            if qid not in trec_relevant:
                trec_relevant[qid] = set()
            if (rel == "1"):
                trec_relevant[qid].add(doc_id)
    return trec_relevant

def read_run_file(run_file):  
    # read the content of the run file produced by our IR system 
    # (in the following exercises you will create your own run_files)
    trec_retrieved = dict()  # query_id -> [docid1, docid2, ...]
    with open(run_file, 'r') as run:
        for line in run:
            (qid, q0, doc_id, rank, score, tag) = line.strip().split()
            if qid not in trec_retrieved:
                trec_retrieved[qid] = []
            trec_retrieved[qid].append(doc_id) 
    return trec_retrieved
    

def read_eval_files(qrels_file, run_file):
    return read_qrels_file(qrels_file), read_run_file(run_file)

(all_relevant, all_retrieved) = read_eval_files('data01/FIR-s05-training-qrels.txt', 'data01/baseline.run')

### _Number of queries_ and _number of retrieved documents per query_
 
The following code counts the number of queries evaluated in the file `baseline.run` (provided in the `data01/` folder, containing the list of doc-ids retrieved using a baseline model) and prints it (use the result from the cell above). For each query, it also prints the number of documents that were retrieved for that query.

In [35]:
print('Number of retrieved documents: %d' % len(all_retrieved))

for qid in all_retrieved:
    print ('Docs retrieved for query #{}: {}'.format(qid, str(len(all_retrieved[qid]))))

Number of retrieved documents: 38
Docs retrieved for query #1: 1000
Docs retrieved for query #3: 1000
Docs retrieved for query #4: 1000
Docs retrieved for query #5: 1000
Docs retrieved for query #6: 1000
Docs retrieved for query #7: 1000
Docs retrieved for query #8: 1000
Docs retrieved for query #9: 1000
Docs retrieved for query #10: 1000
Docs retrieved for query #11: 1000
Docs retrieved for query #12: 1000
Docs retrieved for query #13: 1000
Docs retrieved for query #14: 1000
Docs retrieved for query #15: 1000
Docs retrieved for query #16: 1000
Docs retrieved for query #18: 1000
Docs retrieved for query #20: 1000
Docs retrieved for query #22: 1000
Docs retrieved for query #23: 1000
Docs retrieved for query #24: 1000
Docs retrieved for query #25: 1000
Docs retrieved for query #27: 1000
Docs retrieved for query #28: 1000
Docs retrieved for query #29: 1000
Docs retrieved for query #31: 1000
Docs retrieved for query #32: 1000
Docs retrieved for query #34: 1000
Docs retrieved for query #36:

For your own understanding, __inspect the structure and content of the `all_retrieved` and `all_relevant` data structures__ to understand them better. Use the `print()` function to see the content of the data structures.

In [36]:
# write here the code to inspect the data structures
print("RETRIEVED:")
print(all_retrieved.keys())
print(all_retrieved['1'])
print(all_retrieved['3'])
print("RELEVANT:")
print(all_relevant.keys())
print(all_relevant['1'])
print(all_relevant['3'])


RETRIEVED:
dict_keys(['1', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '18', '20', '22', '23', '24', '25', '27', '28', '29', '31', '32', '34', '36', '37', '38', '39', '40', '42', '44', '45', '46', '48', '50'])
['11929828', '11751903', '12384701', '12065641', '11980715', '12126481', '12455049', '12444545', '12431783', '12204896', '12119358', '12242284', '11886527', '11779850', '12203364', '12110586', '11767002', '12115564', '11827966', '12112322', '11762751', '12368211', '12055678', '11940356', '11989975', '11862714', '11756412', '12203371', '12173048', '11809764', '12124333', '11879190', '12080324', '12079680', '12363184', '12214254', '11950701', '11882322', '12098019', '12495933', '11948417', '11931851', '12151347', '12161501', '11953864', '11795494', '12411199', '12171792', '12445676', '12014641', '12167152', '12185267', '11783178', '12226108', '12485877', '11724777', '11870216', '12242109', '12088113', '12488548', '12466360', '12100577', '12203123', 

## Exercise 01.F: _mean average precision_
__Using the `average_precision()` function you implemented above, write the code to compute the _Mean Average Precision_ for the `baseline.run` results.__

In [56]:
#THIS IS GRADED!

def mean_average_precision(all_relevant, all_retrieved):
    # BEGIN ANSWER
    total = 0
    count = 0
    for key in all_relevant:
        precision =  average_precision(all_relevant[key], all_retrieved.get(key, []))
        total += precision
        count += 1
    # END ANSWER
    return total / count

mapr = mean_average_precision(all_relevant, all_retrieved)
print('Mean Average Precision (MAP): %1.3f\n' % mapr)

Mean Average Precision (MAP): 0.002



***
## TREC benchmark evaluation

Below you find a function that take `all_relevant` and `all_retrieved` to compute the mean value of the `measure` over all queries. 

The function `mean_metric()`'s first function argument, `measure`, is a special argument: it is a function too! The `mean_metric` function sums the total score for the particular measure and divides it by the number of queries. It computes the average measures over all the query results.

_This part will be reused later to compare the results of different models._

In [48]:
def mean_metric(measure, all_relevant, all_retrieved):
    total = 0
    count = 0
    for qid in all_relevant:
        relevant  = all_relevant[qid]
        retrieved = all_retrieved.get(qid, [])
        value = measure(relevant, retrieved)
        total += value
        count += 1
    return "mean " + measure.__name__, total / count

# Example of use of the mean_metric function: computing the average r_precision
mean_metric(r_precision, all_relevant, all_retrieved)

('mean r_precision', 0.09155954402134368)

### TREC overview of the results of a run
The following two functions use your implementation of the metrics to create an overview of the performance metrics on the TREC benchmark data. Give a look at the numbers and make your own interpretations of the results. 

In [54]:
def trec_eval(qrels_file, run_file):

    def precision_at_1(rel, ret): return precision_at_k(rel, ret, k=1)
    def precision_at_5(rel, ret): return precision_at_k(rel, ret, k=5)
    def precision_at_10(rel, ret): return precision_at_k(rel, ret, k=10)
    def precision_at_50(rel, ret): return precision_at_k(rel, ret, k=50)
    def precision_at_100(rel, ret): return precision_at_k(rel, ret, k=100)
    def precision_at_recall_00(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.0)
    def precision_at_recall_01(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.1)
    def precision_at_recall_02(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.2)
    def precision_at_recall_03(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.3)
    def precision_at_recall_04(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.4)
    def precision_at_recall_05(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.5)
    def precision_at_recall_06(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.6)
    def precision_at_recall_07(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.7)
    def precision_at_recall_08(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.8)
    def precision_at_recall_09(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.9)
    def precision_at_recall_10(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=1.0)

    (all_relevant, all_retrieved) = read_eval_files(qrels_file, run_file)
    
    unknown_qids = set(all_retrieved.keys()).difference(all_relevant.keys())
    if len(unknown_qids) > 0:
        raise ValueError("Unknown qids in run: {}".format(sorted(list(unknown_qids))))

    metrics = [success_at_1,
               success_at_5,
               success_at_10,
               r_precision,
               precision_at_1,
               precision_at_5,
               precision_at_10,
               precision_at_50,
               precision_at_100,
               precision_at_recall_00,
               precision_at_recall_01,
               precision_at_recall_02,
               precision_at_recall_03,
               precision_at_recall_04,
               precision_at_recall_05,
               precision_at_recall_06,
               precision_at_recall_07,
               precision_at_recall_08,
               precision_at_recall_09,
               precision_at_recall_10,
               average_precision]

    return [mean_metric(metric, all_relevant, all_retrieved) for metric in metrics]


def print_trec_eval(qrels_file, run_file):
    results = trec_eval(qrels_file, run_file)
    print("Results for {}".format(run_file))
    for (metric, score) in results:
        print("{:<30} {:.4}".format(metric, score))

print_trec_eval('data01/FIR-s05-training-qrels.txt', 'data01/baseline.run')

Results for data01/baseline.run
mean success_at_1              0.1053
mean success_at_5              0.2632
mean success_at_10             0.3158
mean r_precision               0.09156
mean precision_at_1            0.0001053
mean precision_at_5            0.0003947
mean precision_at_10           0.0004737
mean precision_at_50           0.0009737
mean precision_at_100          0.001395
mean precision_at_recall_00    0.2015
mean precision_at_recall_01    0.1898
mean precision_at_recall_02    0.1683
mean precision_at_recall_03    0.1333
mean precision_at_recall_04    0.1236
mean precision_at_recall_05    0.1227
mean precision_at_recall_06    0.08744
mean precision_at_recall_07    0.08435
mean precision_at_recall_08    0.05999
mean precision_at_recall_09    0.05803
mean precision_at_recall_10    0.05803
mean average_precision         0.001789


## Exercise 01.G: _Significance testing_

Testing the statistical significance of differences of the results of different IR systems is important (see slides of lecture 01 and course book, Section 8.8). One of the basic tests one can perform is the two-tailed [sign test](https://en.wikipedia.org/wiki/Sign_test).

Only for this exercise, we use the run files obtained by  [Hiemstra and Aly](https://djoerdhiemstra.com/wp-content/uploads/trec2014mirex-draft.pdf) for the TREC Web track 2014 benchmark (note these files are from a different benchmark from what we have been working with so far). The `utbase.run` file was generated using Language Modeling, while `utexact.run` was generated using an IR system based on matching the exact query string, and ranking the documents by  the number of exact matches found.  The exact run improves the _Precision at 5_ to 0.456 (compared to 0.440 for the baseline run).  

__Implement the code to perform the _sign test_ of statistical significance.__
> _Hint:_ for each sign, compute the number of queries that increase/descrease performance (called `better, worse` in the code below). How would you use these values to compute the _p_ value of the two-tailed sign test? Is the difference between _utbase_ and _utexact_ significant?

In [None]:
#THIS IS GRADED!

def sign_test_values(measure, qrels_file, run_file_1, run_file_2):
    all_relevant = read_qrels_file(qrels_file)
    all_retrieved_1 = read_run_file(run_file_1)
    all_retrieved_2 = read_run_file(run_file_2)
    better = 0
    worse  = 0
    # BEGIN ANSWER
    # END ANSWER
    return(better, worse)
    
def precision_at_rank_5(rel, ret):
    return precision_at_k(rel, ret, k=5)

sign_test_values(precision_at_rank_5, 'data01/trec.qrels', 'data01/utbase.run', 'data01/utexact.run')

In [None]:
# BEGIN ANSWER
# END ANSWER

***
***
***
***
***

# Part 02 - Indexing and querying with ElasticSearch

## Preparation: Getting started with Elasticsearch

The following parts of the assignment will be based on ElasticSearch. you are adviced to go through the "Elasticsearch, [reference guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html)", and work on the tutorials. You can skip the section on [Installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html), as we provide it already installed in the Virtual Machine.

> If you want (disclaimer: we do __not__ give help with this!), you can 
> follow the [Installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html) to run Elasticsearch on your laptop without VM. Beware your system will likely be different from the 
> one of your colleagues and they might not be able to help you if 
> you have problems that are specific to your system, your operating
> system, or your Elasticsearch version.

### Starting/Stopping ElasticSearch
To start ElasticSearch on the virtual machine, you can type `sudo service elasticsearch start` in a Terminal.
To stop the ElasticSearch server, instead, you can type `sudo service elasticsearch stop`. Refer at the [the official guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html#deb-running-init), for more information.

### The REST API

Elasticsearch runs its own server that can be accessed by a regular web browser by opening this link: http://localhost:9200. 

Elasticsearch will respond with something like:

    {
        "name" : "fir-machine",
        "cluster_name" : "elasticsearch",
        "cluster_uuid" : "w7SBVo1ESVivMApbLIqRvA",
        "version" : {
            "number" : "7.9.0",
            "build_flavor" : "default",
            "build_type" : "deb",
            "build_hash" : "a479a2a7fce0389512d6a9361301708b92dff667",
            "build_date" : "2020-08-11T21:36:48.204330Z",
            "build_snapshot" : false,
            "lucene_version" : "8.6.0",
            "minimum_wire_compatibility_version" : "6.8.0",
            "minimum_index_compatibility_version" : "6.0.0-beta1"
        },
        "tagline" : "You Know, for Search"
    }


If you see this, then your Elasticsearch node is up and running. The RESTful API uses simple text or JSON over HTTP. 

> REST, API, JSON, HTTP, that's a lot of abbreviations! It is good to
> be familiar with the terminology. Let us explain: The Elasticsearch
> response is not (only) intended for humans. It is supposed to be used 
> by applications that run on the client machines, and therefore the
> interface is called an Application Programming Interface (API). The 
> API uses a format called JSON (JavaScript Object Notation), which 
> can be easily read by machines (and humans). The API sends its JSON
> response using the same method as your web browser displays web
> pages. This method is called HTTP (Hyper Text Transfer Protocol), 
> and it is the reason you can inspect the response in a normal web
> browser. APIs that use HTTP are called RESTful interfaces. REST 
> stands for REpresentational State Transfer, arguably one of the
> simplest ways to define an API.


### Interacting with the ElasticSearch server

You can interact with your Elasticsearch service in different ways. In this first part we explore Kibana, a dashboard for inspection of your indices. Later during the practical work we will use the Python Elasticsearch client or the DSL library. You can also start yourself with Python.

#### Kibana
Kibana provides a web interface to interact with your Elasticsearch service. It's available from http://localhost:5601. You can use Kibana to create interactive dashboards visualizing data in your Elasticsearch indices. It also provides a console to execute Elasticsearch commands. It's available from http://localhost:5601/app/kibana#/dev_tools

To start Kibana on the virtual machine, you can type `sudo service kibana start` in a Terminal. \
To stop the Kibana server, instead, you can type `sudo service kibana stop`.

Many examples from the Elasticsearch user guide can be directly executed in Kibana by clicking on the `CONSOLE` button.



# Indexing and queries (Exercises - Part 02)

_You can work on this part after Lecture 01 already_


## Collection indexing: useful code

We provide some code to read the TREC collection documents and index them into the ElasticSearch engine.
As we need to re-index the document collection when we use a different indexing configurations (called Mappings in ElasticSearch), we developed some functions to support a quick re-indexing in the following exercises.

Below you find the Python code for bulk-indexing our (FIR)Medline collection. Execute the following cells to index the collection in an Elasticsearch index called `genomics'. Study the code carefully, as you will use the indexing functions later for the completion of the assignment.

> The code uses additional helper functions 
> (`elasticsearch.helpers`) and a library for processing JSON.
> The function `read_documents()` reads the bulk collection file: The 
> function is a [Python generator](https://wiki.python.org/moin/Generators) function. It generates an 'on-demand' list
> by using the statement `yield` for every item of the list. It
> is used in the helper function `elasticsearch.helpers.bulk()`.
> The statement `raise` is Python's approach to throw exceptions: it exits the program with an error.
> Note the (keyword) arguments to bulk:
> `chunk_size` indicates the number of documents to be processed by
> elasticsearch in one batch. 
> The request_timeout is set to 30 seconds because processing a single batch
> of documents can take some time.

> __Note:__ _when processing a bulk index, be sure to have few GigaBytes free on the hard drive of the VM. If you get a BulkIndexError with read-only/FORBIDDEN errors, you probably have too little hard drive space available for ElasticSearch to work properly._


**_Note:_ indexing the (FIR)TREC genomics collection can take some time, be patient.**

In [None]:
import elasticsearch
import elasticsearch.helpers
import json

def read_documents(file_name):
    """
    Returns a generator of documents to be indexed by elastic, read from file_name
    """
    with open(file_name, 'r') as documents:
        for line in documents:
            doc_line = json.loads(line)
            if ('index' in doc_line):
                id = doc_line['index']['_id']
            elif ('PMID' in doc_line):
                doc_line['_id'] = id
                yield doc_line
            else:
                raise ValueError('Woops, error in index file')

def create_index(es, index_name, body={}):
    # delete index when it already exists
    es.indices.delete(index=index_name, ignore=[400, 404])
    # create the index 
    es.indices.create(index=index_name, body=body)
                
def index_documents(es, collection_file_name, index_name, body={}):
    create_index(es, index_name, body)
    # bulk index the documents from file_name
    return elasticsearch.helpers.bulk(
        es, 
        read_documents(collection_file_name),
        index=index_name,
        chunk_size=2000,
        request_timeout=30
    )

In [None]:
# Connect to the ElasticSearch server
es = elasticsearch.Elasticsearch(host='localhost')  # in case you use Docker, the host is 'elasticsearch'

# Index the collection into the index called 'genomics'
body = {} # no indexing options (leave default)
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-base', body)

> You can change the name of the index, in case you want to have different indices of the same collection created with different indexing settings, and compare the performance on the test queries. 

> E.g. you create two indices 'genomics01' and 'genomics02': genomics01 uses the default options, while genomics02 uses custom tokenizers. You will then have two indices with different characteristics (and probably different performance). 

## Exercise 02.A: index properties and querying

__1. Query the index called 'genomics-base' and determine how many documents are indexed.__

You can use Kibana (suggested for the time being - you can use the command line in Kibana), the Python ElasticSearch library or DSL. Report the code you implemented and the resulting number of documents.

In [None]:
#THIS IS GRADED!

# write the code here
# BEGIN ANSWER
# END ANSWER

__2. How many documents containing the term `molecule` are there in your index? (searching all fields of the documents).__

You can use Kibana (suggested for the time being - you can use the command line in Kibana), the Python ElasticSearch library or DSL. Report the code you implemented and the resulting number of documents.

In [None]:
#THIS IS GRADED!

# write the code that generates the answer here (you may also use Kibana)
# BEGIN ANSWER
# END ANSWER

__3. How many documents containing the term `molecular` are there in your index? (searching all fields of the documents).__

You can use Kibana (suggested for the time being - you can use the command line in Kibana), the Python ElasticSearch library or DSL. Report the code you implemented and the resulting number of documents.

In [None]:
#THIS IS GRADED!

# write the code that generates the answer here (you may also use Kibana)
# BEGIN ANSWER
# END ANSWER

__4. How many documents containing the terms `cell` AND `blood` are there in your index? (searching all fields of the documents).__

You can use Kibana (suggested for the time being - you can use the command line in Kibana), the Python ElasticSearch library or DSL. Report the code you implemented and the resulting number of documents.

In [None]:
#THIS IS GRADED!

# write the code that generates the answer here (you may also use Kibana)
# BEGIN ANSWER
# END ANSWER

In [None]:
import elasticsearch
es = elasticsearch.Elasticsearch(host='localhost')  # in case you use Docker, the host is 'elasticsearch'

# this is another solution (if query_string is used, be sure that AND is in the query, otherwise it will not search properly)
term = 'blood AND cell'
body = {"track_total_hits": True, "query": {"query_string": 
                                            {"query": term, 
                                             "default_operator":"AND", 
                                             "auto_generate_synonyms_phrase_query": True }}}
result = es.search(index='genomics-base', body=body)
print("Number of results: {}".format(result['hits']['total']['value']))

## Exercise 02.B: the Python ElasticSearch library

#### Preparation
The command line is fine for doing basic operations on your Elasticsearch indices, but as soon as things get more complex, you better use custom client programs.
We will use the [Elasticsearch client library for Python](https://elasticsearch-py.readthedocs.io). This library will execute the HTTP requests that you have used before (with CURL or Kibana). The library is pre-installed on the VM.

#### Exercise

__Write the code that searches the index for _"molecule"_ using the [search()](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search) function.__ Your code will take at minimum the following steps:

1. import the python library `elasticsearch`.
2. open a connection with the Elasticsearch host `'elasticsearch'` with `Elasticsearch()`.
3. execute a search with `search()` using the index `genomics-base`, and a correct query body.
4. print the JSON output of Elasticsearch 

How many hits are there in your index? Is the result the same as in Exercise 02.A?

> Elasticsearch runs on localhost on your laptop, at port 9200 (so as http://localhost:9200)


In [None]:
#THIS IS GRADED!

import elasticsearch

# your code below
# BEGIN ANSWER
# END ANSWER



The Python client library returns Python objects, that use [dictionaries](https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries) and [lists](https://docs.python.org/3.6/tutorial/introduction.html#lists).
Use a [for loop](https://docs.python.org/3.6/tutorial/controlflow.html#for-statements) to inspect each hit, and print the retrieved document's titles one by one. 

In [None]:
#example
print("Number of results: {}".format(response['hits']['total']['value']))
# your code below


## Exercise 02.C: _Search using the Elasticsearch DSL_

You will notice that the native query format of Elasticsearch can be quite verbose.
Elasticsearh provides the Python library `elasticsearch_dsl` to write more concise Elasticsearch queries. 
This is only to simplify the syntax: the library still issues Elasticsearch queries.

For example, a simple `multi_match` query looks as follows:
```python
query = {
   "query": {
       "multi_match": {}
   }
}
```

The same query can be created with the DSL as follows:
```python
query = Q("multi_match")
```

Especially for more complicated boolean queries, to use the native query format can become complicated.
Read more about the DSL [here](https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html)

__1. Search for the query `molecule` and check whether you get the same number of results as for exercise 02.A(2).__

In [None]:
#THIS IS GRADED!

# your code here
# BEGIN ANSWER
# END ANSWER

__2. Search for the documents that contain the words `cell` AND `blood`, using the DSL library. Check whether you get the same number of results as for exercise 02.A(4).__

In [None]:
#THIS IS GRADED!

# your code here
# BEGIN ANSWER
# END ANSWER

***
##  Exercise 02.D: Making your own TREC run

We will adopt a scientific approach to building search engines. That is, we are not only going to build a search engine and see that it works, but we are also going to _measure_ how well it works, by measuring the search engine's quality. We will adopt the method from the [Text Retrieval Conference](http://trec.nist.gov) (TREC). TREC provides researchers with test collections, that consists of 3 parts:

1. the document collection (in our case a part of the MEDLINE database)
2. the topics (which are natural language descriptions of what the user is searching for: you can think of the as the _queries_)
3. the relevance judgments (for each topic, what documents are relevant)



__Exercise: Complete the code of the Python function `make_trec_run()` that reads the topics [FIR-s05-training-queries-simple.txt](data01/FIR-s05-training-queries-simple.txt), and for each topic does a search using Elasticsearch.__ The program should output a file in the [TREC submission format](https://trec-core.github.io/2017/#submission-guidelines). We already provided the first  lines for this exercise, which include:

1. Open the file `'run_file_name'`' for writing and call it `run_file`.
2. Open the file `'topics_file_name'` for reading, call it `test_queries`.
3. For each line in `test_queries`:
4. Remove the newline using `strip()`, then split the string on the tab character (`'\t'`). The first part of the line is now `qid` (the query identifier) and the last part is `query` (a textual description of the query).
5. complete the Python program such that the correct TREC run file is written to `'run_file_name'`.

> **Note**: Make sure you output the `PMID` (pubmed identifier) of the document `hit['_source']['PMID']`. Do **not** use the elasticsearch identifier `_id` because they do not match the document identifiers in the relevance judgements. They were randomly generated by Elasticsearch during indexing.


__Make sure to search in the fiels `TI` and `AB`, which correspond to the title and abstract, respectivelt, of the scientific papers of the MEDLINE collection.__

In [None]:
#THIS IS GRADED!

def make_trec_run(es, topics_file_name, run_file_name, index_name="genomics", run_name="test"):
    with open(run_file_name, 'w') as run_file:
        with open(topics_file_name, 'r') as test_queries:
            for line in test_queries:
                (qid, query) = line.strip().split('\t')
                # BEGIN ANSWER
                # END ANSWER
                
# connect to ES server             
es = elasticsearch.Elasticsearch('localhost')
# Write the results of the queries contained in the topic file `'data/training-queries-simple.txt'` 
# to the run file `'baseline.run'`, and name this test as `test01`
make_trec_run(es, 'data01/FIR-s05-training-queries-simple.txt', 'baseline.run', "genomics-base", run_name='test01')

In [None]:
# this prints out (it is a shell command) the content of the file baseline.run 
!cat baseline.run

> Tip: Write a line to `run_file` using `run_file.write(line)`. 
> The newline character is: `'\n'`. Before writing a number to
> the file, cast it to a string using `str()`.
>
> The TREC Submission guidelines allow you to submit up to 1000
> documents per topic. Keep this in mind!

***
***
***
***
***
***

# Part 03: Search models 


<span style="background:red; color: white;">__You are advised to work on this part after Lecture 02 (Models of search)__</span>


### Background
The way documents are indexed influences the performance of the IR systems. 
Elasticsearch [Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/mapping.html) define how a document, and its properties (fields) are stored and indexed, but also provides tools to implement and execute different document similarity measures (i.e. search models).  When using a different configuration of an ElasticSearch Mapping, the document collection needs to be re-indexed (or a new index need to be created - use the functions we provided above to do that).

> See again: [Index Settings and Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/indices-create-index.html).
> _Note: the default model (similarity) in ElasticSearch is BM25. Different models need to be specified (see example)._

For instance, we can add a new field `"title-abstract"` that uses the  [similarity measure](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/similarity.html) _Boolean_, and let it serve as an index for the fields `"TI"` and `"AB"` (title and abstract):

> Plase note that if you want to use the `boolean` similarity for the single fields, you need to specify it for each field. Otherwise, the default BM25 will be used.

In [None]:
boolean = {
  "settings" : {
    # a single shard, so we do not suffer from approximate document frequencies
    "number_of_shards" : 1
  },
  "mappings": {
      "properties": {
        "AB": {
          "type": "text",
          "copy_to": "title-abstract",
          "similarity": "boolean"
        },
        "TI": {
          "type": "text",
          "copy_to": "title-abstract",
          "similarity": "boolean"
        },
        "title-abstract": {  # compound field
          "type": "text",
          "similarity": "boolean"
        }
      }
  }
}

es = elasticsearch.Elasticsearch('localhost')
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-bool', body=boolean)

> Most changes to the mappings cannot be done on an existing index. Some (for instance
> similarity measures) can be changed if the index is first closed. Nevertheless, we 
> will in this notebook _re-index_ the collection for every change to the mappings
> using the function `index_documents()` that we defined above. Mappings (and settings)
> can be passed to the function using the `body` parameter.

<span style="background:#444; color: white;">__We suggest you to create different indices using different models of search (according to the available disk space on your VM). This will avoid that changes are not correctly applied, and you won't see the expected results.__</span>

<span style="background:#444; color: white;">E.g. for the 'boolean' model, we created the 'genomics-bool' index.</span>

Let's have a look at the mappings and settings for our index as follows:

In [None]:
es.indices.get(index='genomics-bool')

Now let's search our new field `"title-abstract"` as follows:

In [None]:
query = "molecule"
search_type = "dfs_query_then_fetch" # this will use exact document frequencies even for multiple shards
body = {
  "query": {
    "match" : { "title-abstract" : query }
  },
  "size": 10
}
es.search(index="genomics-bool", search_type=search_type, body=body)

## Exercise 04.A: _new run and evaluation_
Create a new run file (e.g. `boolean.run`), compute the retrieval performance with the function `print_trec_eval()` and compare the results with the baseline run file `baseline.run`.

In [None]:
#THIS IS GRADED!

# write your code here
# BEGIN ANSWER
# END ANSWER

## Exercise 04.B: _Language models_

Custom similarities can be configured by tuning the parameters of the built-in similarities. Read more about these (expert) options in the [similarity module](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/index-modules-similarity.html).

> Tip: the example similarity settings have to be used in a `"settings"` object.
> Check your settings and mappings with: `es.indices.get(index='NAME-OF-INDEX')`.

__1. Make a run that uses Language Models with [Jelinek-Mercer smoothing](http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html) (linear interpolation smoothing) on the field `"all"` that indexes the fields `"TI"` and `"AB"`. Use the parameter `lambda=0.2`.__

In [None]:
#THIS IS GRADED!

lmjelinekmercer = {
  # BEGIN ANSWER
  # END ANSWER
}

In [None]:
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-jm', body=lmjelinekmercer)
make_trec_run(es, 'data01/FIR-s05-training-queries-simple.txt', 'lmjelinekmercer.run', 'genomics-jm')

__2. Make a run that uses Language Models with [Dirichelet smoothing](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html) to index the fields `"TI"` and `"AB"`. Use the parameter `mu=2000`.__

In [None]:
#THIS IS GRADED!

dirichlet = {
  # BEGIN ANSWER
  # END ANSWER
}

In [None]:
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-dirichlet', body=dirichlet)
make_trec_run(es, 'data01/FIR-s05-training-queries-simple.txt', 'dirichlet.run', 'genomics-dirichlet')

## Exercise 04.C: _Model comparison_


__1. Compute the performance results of the `lmjelinekmercer.run` and `dirichelet.run`. Compare them with those of the `baseline.run` and `boolean.run`. Evaluate the runs using the `print_trec_eval` function. Performing statistical tests may help strengthen your claims.__

In [None]:
#THIS IS GRADED!

# your comments here
# BEGIN ANSWER
# END ANSWER

In [None]:
print('Top20 retrieved documents baseline.run')
! head -10 baseline.run

print('\nTop20 retrieved documents boolean.run')
! head -10 boolean.run

print('\nTop20 retrieved documents lmjelinekmercer.run')
! head -10 lmjelinekmercer.run

print('\nTop20 retrieved documents dirichelet.run')
! head -10 dirichelet.run



__2. Provide below your comments and interpretations of the results. Why, in your opinion, one model of search is better than the others?__

In [None]:
# answer as a comment:
# 



## Example: _ElasticSearch Analyzers for tokenization_

The amount and quality of the tokens used to construct the inverted index are of great importance. In ElasticSearch, mappings and settings also allow specifying what [Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html) is used to tokenize your documents and queries. In the mappings below, use the _Dutch_ analyzer for the field `"all"`):

> Usually, the same analyzer should be applied to documents and queries, but 
> Elasticsearch allows you to specify a `"search_analyzer"` that is used on 
> your queries (which we do not need to use in the assignment).

In [None]:
analyzer_test = {
  "mappings": {
      "properties": {
        "all": {
          "type": "text",
          "analyzer": "dutch"
        }
      }
  }
}

# create the index, but don't index any documents:
create_index(es, 'test-tokens', body=analyzer_test)

The analyzer defined for the `"all"` field can be tested [as follows](https://elasticsearch-py.readthedocs.io/en/master/api.html#indices). Translated to English the text says: _"This is a Dutch sentence"_. 

> The following script identifies the tokens (based on the use of the dutch tokenizer): try with different tokenizers and different sentences to see how the tokens are created.

In [None]:
from pprint import pprint # pretty print

body = { "field": "all", "text": "dit zijn nederlandse zinnen"}
tokens = es.indices.analyze(index='test-tokens', body=body)
pprint(tokens)

***
***
***
***
***
***

# Part 04: Index improvements: Tokenization
<span style="background:red; color: white;">__You are advised to work on this part after Lecture 03 (Conceptual Indexing)__</span>



## Background
The following part of the assignment requires some self-study of the ElasticSearch tools to support the improvemnet of the indexing. Please read the:
* [Index Settings and Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-create-index.html).
* Elasticsearch [Analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/analysis.html) contain many options for improving your search engine.

> You are suggested to use the [Python Elasticsearch Client](https://elasticsearch-py.readthedocs.io) library documentation.

##  Exercise 03.A: _chat language analyzer_

Read the documentation for [Custom Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/analysis-custom-analyzer.html). 
Make a custom analyzer for _English chat language_. The analyzer should do the following:
* change common abbreviations to the full forms: 
  * _b4_ to _before_, 
  * _abt_ to _about_, 
  * _chk_ to _check_, 
  * _dm_ to _direct message_,
  * _f2f_ to _face-to-face_
* use the _standard_ tokenizer;
* put everything to lower-case;
* filter English stopwords.

In [None]:
#THIS IS GRADED!

tweet_analyzer = {
  # BEGIN ANSWER
  # END ANSWER
}

# create the index, but don't index any documents:
create_index(es, 'genomics', body=tweet_analyzer)
body = { "field": "all", "text": "done it b4! what abt dm me?"}
tokens = es.indices.analyze(index='genomics', body=body)
pprint(tokens)

## Exercise 03.B: Stemmers

Referring at Exercise 02.A, we have seen that queries like `molecule` and `molecular` retrieve different sets of documents. Lemmatizer and stemmers can help the indexing and search of 'similar' terms, and retrieve more consistent sets of documents.

__Use the ElasticSearch [Stemming](https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html) to index the document collection. Then retrieve documents with the queries `molecule` and `molecular` and comment on the eventual differences with the previous query results.__


In [None]:
#THIS IS GRADED!

body = {
  # BEGIN ANSWER
  # END ANSWER
}

body = {
  # BEGIN ANSWER
  # END ANSWER
}

In [None]:
# Connect to the ElasticSearch server
es = elasticsearch.Elasticsearch(host='localhost')  # in case you use Docker, the host is 'elasticsearch'

# Index the collection into the index called 'genomics-stem'
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-stem', body)

__Retrieve documents with the queries `molecule` and `molecular` and comment on the eventual differences with the previous query results.__

In [None]:
#THIS IS GRADED!
# BEGIN ANSWER
# END ANSWER

In [None]:
# Comment here about the eventual different results you get 
# -> words that are stemmed, like molecule and molecules, should improve the retrieval results (in terms of amount of retrieved dos)
# -> words that are not stemmed, like 'molecular' should not see much differernt results

# the point of the exercise is not to have the same results with molecule and molecular, but 
# what the stemmer does, and reason after that.

# BONUS PART: _Implement your own similarity measure_ 

We have only seen the results of using the analyzer to queries. The analyzer results from the _documents_ are available using the `termvectors()` function, as follows for document `id=3`: (Additionally, we can get overall field statistics, such as the number of documents)

> First, index the collection again. While waiting, have a coffee or tea :) 

> `id=3` refers to the internal document identifiers, so not to the Pubmed identifier.

_The bonus exercise is not mandatory. It can compensate for missing other exercises._

In [None]:
import elasticsearch
es = elasticsearch.Elasticsearch(host='localhost')

# index_documents(es, 'data/FIR-05-medline.json', 'genomics-base')

es.termvectors(index="genomics-base", id="3", fields="TI", 
               term_statistics=True, field_statistics=True, offsets=False)

### Implement the BM25 similarity

Complete the function `bm25_similarity()` below by implementing the BM25 similarity as described by in Section 11.4.3 of [Manning, Raghavan and Schuetze, Chapter 11](https://nlp.stanford.edu/IR-book/pdf/11prob.pdf). Are you able to replicate the score of ElasitcSearch (9.55)? If not, are you using a different variant of the BM25 model? Provide your comments in plain text.

In [None]:
#THIS IS GRADED!

import math

# math.log(x) computes the logarithm of x

def bm25_similarity (query, doc_id):

    # Get the query tokens (see above)
    query_tokens = es.indices.analyze(index='genomics-base', body={"field":"TI", "text": query})
    tokens = query_tokens['tokens']

    # Get the term vector for doc_id and the field statistics
    term_vector = es.termvectors(index="genomics-base", id=doc_id, fields="TI", 
                  term_statistics=True, field_statistics=True, offsets=False)
    vector = term_vector['term_vectors']['TI']['terms']
    f_stats = term_vector['term_vectors']['TI']['field_statistics']

    # The answer should sum over 'tokens', check if the tokens exists in the 'vector',
    # and if so, add the appropriate value to 'similarity'.
    # Tip: add print statements to your code to see what each variable contains.
    
    similarity = 0

    # BEGIN ANSWER
    # END ANSWER
    return similarity

bm25_similarity("structure refinement", 3)

In [None]:
# eventual comments here

See below the 'reference score' computed by ElasticSearch:

In [None]:
body = {
  "query": {
    "match" : { "TI" : "structure refinement" }
  }
}
explain = es.explain(index="genomics-base", id="3", body=body)
print (explain['explanation']['value'])  # BM25 score computed by ElasticSearch