# Scaffolding project

_IN4325: Information retrieval lecture, TU Delft_

Welcome to the **IN4325: Information retrieval** lecture!

This project acts as a gentle introduction to information retrieval for you. You do not need any prior knowledge about IR for this task. Only some Python programming skills are required.

## Getting started

Under the hood, this notebook uses a library called **PyTerrier**. Please check out the first part of our _Introduction to PyTerrier_ series to learn how to install PyTerrier. However, you do not need to interact with PyTerrier directly for now; rather, we're providing you with simple utility functions you can use. Feel free to have a look how these are implemented, but it's not required.

**Task 1**: Install PyTerrier (see the `01-setup.ipynb` notebook).

Now you should be able to import the utility functions. The first time you do this, a dataset will be downloaded and indexed automatically (this will take a minute). If you have any issues running this cell, try removing the `index` directory (if it exists) and restarting the kernel of this notebook.


In [None]:
from util import search, evaluate, evaluate_all

Now that we have loaded the data, you can run search queries. For example:


In [None]:
search("what is the meaning of life")

What you get here is a list of ten documents from the corpus that are ordered by how relevant they are to our query (according to the search engine).

## Query rewriting

The goal of this task is to come up with a way of **rewriting queries** such that the search engine can "understand" them better.

In order to do this, let's first take a look at some example queries from our dataset. We represent these queries using a `pandas.DataFrame`, where the first column corresponds to the **query ID** and the second column corresponds to the **query**:


In [None]:
import pandas as pd

example_queries = pd.DataFrame(
    [
        [
            "443848",
            "does anybody know where i could get a free guide on how to train a siberian husky",
        ],
        [
            "1783010",
            "what is blaphsemy",
        ],
        [
            "2838988",
            "how can i get a cork out of not into a wine bottle without a corkscrew",
        ],
    ],
    columns=["qid", "query"],
)

Since these queries are taken from the dataset, we can **evaluate the performance** of our search engine on these queries. This means that we know which documents the system should retrieve for each query.

You can use the following evaluation function to do this. This function takes your queries and returns a score (mean average precision -- you will learn about this later). For now, all you need to know is that, the higher this score, the better the system works.

Let's evaluate the queries we have:


In [None]:
print("score:", evaluate(example_queries))

Now it's up to you to figure out if and how it's possible to make the search engine perform better on these queries. How would you query a search engine if you wanted to know about these topics? Experiment a bit.

**Task 2**: Try to manually come up with ways to rewrite or reformulate the queries so the performance improves.

**Important**: Make sure that the query IDs match! Otherwise, evaluation will not work.


In [None]:
example_queries_rewritten = pd.DataFrame(
    [
        [
            "443848",
            "",  # TODO: add rewritten query here
        ],
        [
            "1783010",
            "",  # TODO: add rewritten query here
        ],
        [
            "2838988",
            "",  # TODO: add rewritten query here
        ],
    ],
    columns=["qid", "query"],
)
print("score after rewriting:", evaluate(example_queries_rewritten))

# An automatic approach

In this last part, we'll try to come up with an automatic approach to perform query re-writing. Use your findings from task 2 for this.

**Task 3**: Implement a function that automatically re-writes any input query.

You can use any approach or library you want for this task. However, keep in mind that simple ideas often work well!


In [None]:
def rewrite_query(query: str) -> str:
    # TODO: rewrite the query here
    return query

This time, we'll evalute on _all_ queries in the dataset. This will give us a more general result:


In [None]:
print("score:", evaluate_all())

Are you able to improve the overall performance using your rewriting approach?


In [None]:
print("score after rewriting", evaluate_all(rewrite_query))