# Scaffolding project

_DSAIT4050: Information retrieval lecture, TU Delft_

Welcome to the **DSAIT4050: Information retrieval** lecture!

This project acts as a gentle introduction to information retrieval for you. You do not need any prior knowledge about IR for this task. Only some Python programming skills are required.

## Getting started

Under the hood, this notebook uses a library called **PyTerrier**. Please check out the first part of our _Introduction to PyTerrier_ series to learn how to install PyTerrier. However, you do not need to interact with PyTerrier directly for now; rather, we're providing you with simple utility functions you can use. Feel free to have a look how these are implemented, but it's not required.

**Task 1**: Install PyTerrier (see the `01-setup.ipynb` notebook).

Now you should be able to import the utility functions. A dataset will be downloaded and indexed automatically (this will take a minute).


In [1]:
import pyterrier as pt
pt.init(mem=8192)   # 8GB heap (recommended if you have 16GB+ RAM)

Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
The following code will have the same effect:
pt.java.set_memory_limit(8192)
pt.java.init() # optional, forces java initialisation
  pt.init(mem=8192)   # 8GB heap (recommended if you have 16GB+ RAM)


In [2]:
from util import search, evaluate, evaluate_all

antique/test/non-offensive documents: 100%|██████████| 403666/403666 [02:13<00:00, 3023.17it/s]


Now that we have loaded the data, you can run search queries. For example:


In [60]:
search("Remove cork wine bottle corkscrew")

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,151275,860258_0,Twist the corkscrew into the cork. pull the ha...,0,37.810061,Remove cork wine bottle corkscrew
1,1,46192,2567318_1,You smell the cork. If it smells funny or cork...,1,32.618232,Remove cork wine bottle corkscrew
2,1,46195,2567318_4,Just because there was no discoloration or gro...,2,32.466415,Remove cork wine bottle corkscrew
3,1,46199,2567318_8,"it's old-fashioned, and really rather pointles...",3,30.115309,Remove cork wine bottle corkscrew
4,1,41948,2293005_1,baking soda and viniger in a wine bottle with ...,4,30.068586,Remove cork wine bottle corkscrew
5,1,46191,2567318_0,"If the wine has gone bad, you can tell by smel...",5,30.005108,Remove cork wine bottle corkscrew
6,1,389027,2838988_2,if you dont have access to a corkscrew cut it ...,6,29.606434,Remove cork wine bottle corkscrew
7,1,46194,2567318_3,Primarily so you can check the cork matches th...,7,27.755242,Remove cork wine bottle corkscrew
8,1,18134,2943171_1,"If you are desperate, break the top off and po...",8,27.676813,Remove cork wine bottle corkscrew
9,1,215342,1357216_4,"You better just push the cork all the way in, ...",9,27.508018,Remove cork wine bottle corkscrew


What you get here is a list of ten documents from the corpus that are ordered by how relevant they are to our query (according to the search engine).

## Query rewriting

The goal of this task is to come up with a way of **rewriting queries** such that the search engine can "understand" them better.

In order to do this, let's first take a look at some example queries from our dataset. We represent these queries using a `pandas.DataFrame`, where the first column corresponds to the **query ID** and the second column corresponds to the **query**:


In [4]:
import pandas as pd

example_queries = pd.DataFrame(
    [
        [
            "443848",
            "does anybody know where i could get a free guide on how to train a siberian husky",
        ],
        [
            "1783010",
            "what is blaphsemy",
        ],
        [
            "2838988",
            "how can i get a cork out of not into a wine bottle without a corkscrew",
        ],
    ],
    columns=["qid", "query"],
)

In [None]:
import util
dataset = util.DATASET
bm25 = util.BM25
bm25(dataset.get_topics())

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,3990512,60507,1806638_9,"if your name was Robert Geoffrey Howe , then G...",0,21.741962,how can we get concentration onsomething
1,3990512,402618,1903438_2,TO LISTEN TO fucking bad bitches in the backse...,1,20.794981,how can we get concentration onsomething
2,3990512,183741,2090551_2,Well i never saw that. Hows that possible. huh,2,20.256742,how can we get concentration onsomething
3,3990512,256477,3178012_5,"if its brown flush it down, if its yellow let ...",3,17.447058,how can we get concentration onsomething
4,3990512,91987,2534403_0,Easter is the first Sunday after the Paschal F...,4,17.196628,how can we get concentration onsomething
...,...,...,...,...,...,...,...
159835,1340574,132629,3067105_8,Start going to bed earlier in the night. Make...,995,7.160186,why do some people only go to church on easter...
159836,1340574,133713,2097261_3,I'm going to cover you like a chicken on a jun...,996,7.160186,why do some people only go to church on easter...
159837,1340574,135325,3574610_0,That's a pretty personal question - I think yo...,997,7.160186,why do some people only go to church on easter...
159838,1340574,135495,2449369_1,pay the ticket now avoid going to court,998,7.160186,why do some people only go to church on easter...


In [92]:
search("blasphemy definition meaning of blasphemy what is blasphemy religion")

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,342040,314742_3,The best 'religion' is the one that is meaning...,0,8.729942,blasphemy definition meaning of blasphemy what...
1,1,269119,698757_13,religion. without religion there would be no w...,1,7.100619,blasphemy definition meaning of blasphemy what...
2,1,3162,3090366_0,The best map I could find of is below (first l...,2,6.780162,blasphemy definition meaning of blasphemy what...
3,1,179316,1974867_10,good question; but those who do follow religio...,3,6.748019,blasphemy definition meaning of blasphemy what...
4,1,166020,2714495_7,do something meaningful and leave me alone.,4,6.421543,blasphemy definition meaning of blasphemy what...
5,1,265528,4188756_15,Because you only get one shot to do something ...,5,6.421543,blasphemy definition meaning of blasphemy what...
6,1,284036,327334_2,If life did not suck sometime it would not be ...,6,6.421543,blasphemy definition meaning of blasphemy what...
7,1,161990,2344586_12,a better question would be : why did god allow...,7,6.346504,blasphemy definition meaning of blasphemy what...
8,1,24563,514843_5,"To live until we die,and have a meaningful. li...",8,6.214717,blasphemy definition meaning of blasphemy what...
9,1,57646,3078918_0,You've found something more meaningful? Or you...,9,6.214717,blasphemy definition meaning of blasphemy what...


In [93]:
example_queries_rewritten = pd.DataFrame(
    [
        [
            "1783010",
            "blasphemy definition meaning of blasphemy what is blasphemy religion",
        ],

    ],
    columns=["qid", "query"],
)
print("score after rewriting:", evaluate(example_queries_rewritten))

score after rewriting: 0.0003675593290791388


Since these queries are taken from the dataset, we can **evaluate the performance** of our search engine on these queries. This means that we know which documents the system should retrieve for each query.

You can use the following evaluation function to do this. This function takes your queries and returns a score (mean average precision -- you will learn about this later). For now, all you need to know is that, the higher this score, the better the system works.

Let's evaluate the queries we have:


In [5]:
print("score:", evaluate(example_queries))

[INFO] Please confirm you agree to the authors' data usage agreement found at <https://ciir.cs.umass.edu/downloads/Antique/readme.txt>
[INFO] [starting] https://ciir.cs.umass.edu/downloads/Antique/test-queries-blacklist.txt
[INFO] [finished] https://ciir.cs.umass.edu/downloads/Antique/test-queries-blacklist.txt: [00:00] [184B] [184kB/s]
[INFO] [starting] https://ciir.cs.umass.edu/downloads/Antique/antique-test.qrel                  
[INFO] [finished] https://ciir.cs.umass.edu/downloads/Antique/antique-test.qrel: [00:00] [150kB] [494kB/s]
                                                                                        

score: 0.07906002902973568


Now it's up to you to figure out if and how it's possible to make the search engine perform better on these queries. How would you query a search engine if you wanted to know about these topics? Experiment a bit.

**Task 2**: Try to manually come up with ways to rewrite or reformulate the queries so the performance improves.

**Important**: Make sure that the query IDs match! Otherwise, evaluation will not work.


In [None]:
example_queries_rewritten = pd.DataFrame(
    [
        [
            "443848",
            "siberian husky training guide how to train a siberian husky puppy siberian husky training tips",
        ],
        [
            "1783010",
            "blasphemy definition meaning of blasphemy what is blasphemy religion",
        ],
        [
            "2838988",
            "remove cork from wine bottle without corkscrew",
        ],
    ],
    columns=["qid", "query"],
)
print("score after rewriting:", evaluate(example_queries_rewritten))

score after rewriting: 0.13508076017038914


# An automatic approach

In this last part, we'll try to come up with an automatic approach to perform query re-writing. Use your findings from task 2 for this.

**Task 3**: Implement a function that automatically re-writes any input query.

You can use any approach or library you want for this task. However, keep in mind that simple ideas often work well!


In [90]:
import re
import nltk
from nltk.corpus import stopwords

# Ensure the stopwords list is downloaded
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

def rewrite_query(query: str) -> str:
    # 1. Convert the entire query to lowercase for consistency
    query = query.lower()
    
    # 2. Remove punctuation (keeps only alphanumeric characters and spaces)
    query = re.sub(r'[^\w\s]', '', query)
    
    # 3. Split the query into individual words
    words = query.split()
    
    # 4. Filter out any words that are in the NLTK English stop words list
    filtered_words = [word for word in words if word not in stop_words]
    
    # 5. Rejoin the useful keywords into a single string
    return " ".join(filtered_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yutin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


This time, we'll evalute on _all_ queries in the dataset. This will give us a more general result:


In [87]:
print("score:", evaluate_all())

score: 0.06179994498738492


Are you able to improve the overall performance using your rewriting approach?


In [91]:
print("score after rewriting", evaluate_all(rewrite_query))

score after rewriting 0.08758435072026084
