# Université Paul Sabatier
# M1 IAFA - Foundations of Information Retrieval - 2025

Instructors: Lynda Tamine and Jesús Lovón

Notebook proposé par : José G. Moreno

---

💡 Consider developing auxiliary scripts and functions that will enable you to reuse recurring commands in this practical work (PW) and future ones. This would help you keeping good code practice and make debugging easier.


### Attention❗ About TP grading:
🚨 *Code questions*: Fill in the missing code in the corresponding sections (commented code gets the best marks).

🚨 *Open questions*: Write your textual answer as a comment in the corresponding cells.

🚨 *Keep your outputs*: **Empty outputs (notebook or non-executed cells) correspond to 0 points**.

---

# TP 3. PyTerrier - Learning to Rank

In this PW, we focus on constructing **retrieval pipelines** using [PyTerrier](https://github.com/terrier-org/pyterrier). We will conduct experiments with the previously used dataset **TREC-CORD19**, and then we will apply the same methodology to the **FIQA** question-answering dataset to explore retrieval in a different context.  

This lab is divided into two parts:  

### I. Understanding PyTerrier and Retrieval Pipelines
In this section, you will:  
- Learn how PyTerrier structures and processes data.  
- Understand the core concepts of retrieval pipelines and how to combine multiple search operators.  
- Conduct experiments using the **TREC-CORD19 test collection** to apply these principles in practice.  

### II. Learning to Rank (LTR) Pipelines  
In the second part, you will:  
- Build and train **Learning to Rank (LTR) pipelines** to optimize retrieval effectiveness.  
- Evaluate and analyze LTR-based pipelines using standard **IR evaluation metrics**.  

By the end of this lab, you will have hands-on experience with **PyTerrier pipelines**, retrieval experiments, and ranking models.



## Installations and Setup

> 👉 This PW only requires a *CPU runtime*.

In [None]:
# Some libraries to use later
!pip install python-terrier
!pip install scikit-learn matplotlib
!pip install datasets
!pip install -q --upgrade fastrank lightgbm==3.1.1

Collecting python-terrier
  Downloading python_terrier-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting ir-datasets>=0.3.2 (from python-terrier)
  Downloading ir_datasets-0.5.10-py3-none-any.whl.metadata (12 kB)
Collecting wget (from python-terrier)
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyjnius>=1.4.2 (from python-terrier)
  Downloading pyjnius-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting ir-measures>=0.3.1 (from python-terrier)
  Downloading ir_measures-0.3.7-py3-none-any.whl.metadata (7.0 kB)
Collecting pytrec-eval-terrier>=0.5.3 (from python-terrier)
  Downloading pytrec_eval_terrier-0.5.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (777 bytes)
Collecting dill (from python-terrier)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting chest (from python-terrier)
  Downloading chest-0.2.3.tar.gz (9.6 kB)
  Preparing metadata (setup.py

In [None]:
# Load pyterrier and CORD19 dataset
#Initialization de JVM
import pyterrier as pt
if not pt.started():
  pt.init(tqdm='notebook')
cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')



  if not pt.started():


terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
The following code will have the same effect:
pt.utils.set_tqdm('notebook')
pt.java.init() # optional, forces java initialisation
  pt.init(tqdm='notebook')


In [None]:
# Index the collection
import os
!rm -rf ./terrier_cord19/

pt_index_path = './terrier_cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
    # create the index, using the IterDictIndexer indexer
    indexer = pt.index.IterDictIndexer(pt_index_path, text_attrs=['abstract'], meta=['title','docno'])

    # we give the dataset get_corpus_iter() directly to the indexer
    # while specifying the fields to index and the metadata to record
    # index_ref = indexer.index(cord19.get_corpus_iter(),
    #                           text_attrs=['abstract'])
    indexref = indexer.index(cord19.get_corpus_iter(), )

else:
    # if you already have the index, use it.
    indexref = pt.IndexRef.of(pt_index_path + "/data.properties")

index = pt.IndexFactory.of(indexref)

[INFO] [starting] building docstore
[INFO] If you have a local copy of https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/80d664e496b8b7e50a39c6f6bb92e0ef
[INFO] [starting] https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv
docs_iter:   0%|                                    | 0/192509 [00:00<?, ?doc/s]
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.0%| 0.00/269M [00:00<?, ?B/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.1%| 139k/269M [00:00<03:51, 1.16MB/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.5%| 1.22M/269M [00:00<00:53, 4.97MB/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 2.4%| 6.35M/269M [00:00<00:15, 16.8MB/s][A
https://ai2-semantic

cord19/trec-covid documents:   0%|          | 0/192509 [00:00<?, ?it/s]

19:55:07.624 [ForkJoinPool-1-worker-3] ERROR org.terrier.structures.indexing.Indexer -- Could not finish MetaIndexBuilder: 
java.io.IOException: Key 8lqzfj2e is not unique: 37597,11755
For MetaIndex, to suppress, set metaindex.compressed.reverse.allow.duplicates=true
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.mergeTwo(FSOrderedMapFile.java:1374)
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.close(FSOrderedMapFile.java:1308)
	at org.terrier.structures.indexing.BaseMetaIndexBuilder.close(BaseMetaIndexBuilder.java:321)
	at org.terrier.structures.indexing.classical.BasicIndexer.indexDocuments(BasicIndexer.java:270)
	at org.terrier.structures.indexing.classical.BasicIndexer.createDirectIndex(BasicIndexer.java:388)
	at org.terrier.structures.indexing.Indexer.index(Indexer.java:377)
	at org.terrier.python.ParallelIndexer$3.apply(ParallelIndexer.java:131)
	at org.terrier.python.ParallelIndexer$3.apply(ParallelIndexer.java:120)
	at java

# I. Understanding PyTerrier and Retrieval Pipelines

You remember that `BatchRetrieve` has a `transform()` method that takes a dataframe as input, and returns another dataframe, which is in some way a *transformation* of the previous dataframe (for example, a search result).

In fact, `BatchRetrieve` is just one of many similar objects in PyTerrier, which we call [transformers](https://pyterrier.readthedocs.io/en/latest/transformer.html) (represented by the `TransformerBase` class).

For example, the tfidf model application can be explicitly saved in the varible ```tfidf```.

In [None]:
tfidf = pt.BatchRetrieve(indexref, wmodel="TF_IDF")

  tfidf = pt.BatchRetrieve(indexref, wmodel="TF_IDF")


The interesting thing about all transformers is that they can be combined using Python operators (this is called operator overloading).

In concrete terms, imagine you want to chain transformers together - for example, sort documents first by tf and then re-sort the documents by tfidf. We can do this using the `>>` operator - we call it *composition*.

There are a number of PyTerrier operators - there are more examples in the [PyTerrier documentation on operators](https://pyterrier.readthedocs.io/en/latest/operators.html)

#### Questions ✍
#### **1. Pipeline construction**

Create a *ranker* (a class that can be searched on the index) that performs the following operations:
 - obtain the 10 documents rated highest by tf (`wmodel=“Tf”`)
 - obtain the 10 documents rated highest by tfidf (`wmodel=“TF_IDF”`)
 - reorders only those documents found in any of the previous search parameters using BM25.

using PyTerrier operators combining different instances of BatchRetrieve.


In [None]:
#### Your code here
# Étape 1 : Créer un retriever pour le modèle TF
ranker_tf = pt.terrier.Retriever(index, wmodel="Tf")

# Étape 2 : Créer un retriever pour le modèle TF-IDF
ranker_tfidf = pt.terrier.Retriever(index, wmodel="TF_IDF")

# Étape 3 : Créer un retriever pour le modèle BM25
ranker_bm25 = pt.terrier.Retriever(index, wmodel="BM25")

# Étape 4 : Créer un pipeline avec les opérateurs >> pour combiner les modèles
pipeline = ranker_tf >> ranker_tfidf >> ranker_bm25

# Exécuter le pipeline sur une requête
query = "COVID-19 vaccine"

# Pour exécuter une seule requête et afficher les résultats, on utilise search
result = pipeline.search(query)

# Afficher les résultats
print(result)

    qid   docid     docno  rank     score             query
0     1   58893  cd5dyof9     0  8.659311  COVID-19 vaccine
1     1   29970  jwd96s79     1  8.438493  COVID-19 vaccine
2     1   82260  91rm1uvs     2  8.375178  COVID-19 vaccine
3     1   51059  xhe9nuvt     3  8.333328  COVID-19 vaccine
4     1  175627  4xkux5z4     4  8.314911  COVID-19 vaccine
..   ..     ...       ...   ...       ...               ...
995   1   74965  5zcydnre   995  3.349829  COVID-19 vaccine
996   1  127211  gseo0glh   996  3.316770  COVID-19 vaccine
997   1   68328  u85q2r4x   997  3.252767  COVID-19 vaccine
998   1   73125  tkvgzbuk   998  3.216122  COVID-19 vaccine
999   1  138791  9q4dsfyy   999  3.125063  COVID-19 vaccine

[1000 rows x 6 columns]


#### Questions ✍

2. How many documents are retrieved by this complete pipeline for the query `“chemical”`?
> Hint: If you get the solution right, the document with docno `“8hykq71k”` should have a score close to $12.413269$ for the query `“chemical”`.

Tips:
 - choose your [PyTerrier operators](https://pyterrier.readthedocs.io/en/latest/operators.html) carefully
 - you shouldn't have to perform any operations on the dataframes.

In [None]:
#### Your code here
# Exécuter la recherche avec la requête "chemical"
query = "chemical"
result = pipeline.search(query)

# Afficher les résultats (les 10 premiers documents)
print(result.head(10))

# Vérifier le nombre total de documents récupérés
print(f"Total number of documents retrieved: {len(result)}")

# Vérifier le score du document spécifique "8hykq71k"
specific_doc = result[result['docno'] == '8hykq71k']
if not specific_doc.empty:
    print(f"Score for document 8hykq71k: {specific_doc['score'].values[0]}")
else:
    print("Document 8hykq71k not found.")


  qid   docid     docno  rank      score     query
0   1   37771  jn5qi1jb     0  12.426309  chemical
1   1   15671  8hykq71k     1  12.413269  chemical
2   1  134305  0smev8vt     2  12.292890  chemical
3   1  142104  77c9ohxj     3  12.226076  chemical
4   1   87642  ck6clsty     4  12.155804  chemical
5   1   18717  iavwkdpr     5  12.036691  chemical
6   1   56631  sps45fj5     6  11.642770  chemical
7   1   11310  3ehh7wme     7  11.564529  chemical
8   1  183314  65e8ol64     8  11.525981  chemical
9   1    2524  ifebw24e     9  11.439890  chemical
Total number of documents retrieved: 1000
Score for document 8hykq71k: 12.413269060886398


## Pipeline Evaluation
Unlike TP2, where we did the evaluation ourselves, in this TP we'll be using the built-in evaluation module for Pyterrier pipelines. So, to carry out experiments (evaluate and compare models) on Pyterrier, we can use the `Experiment` class ([documentation](https://pyterrier.readthedocs.io/en/latest/experiments.html)).

Here's the code that evaluates the performance of `tfidf` for the cord19 collection.  

In [None]:
# Download data
topics = cord19.get_topics(variant='description')
qrels = cord19.get_qrels()

print(f"Total topics {len(topics)}, and qrels: {len(qrels)}")

# Code to evaluate using Experiment
from pyterrier.measures import *
pt.Experiment(
  #The pipeline
  [tfidf],
  topics,
  qrels,
  eval_metrics=[MAP, nDCG, nDCG@10],
  # we use TFIDF for the statistical tests
  baseline=0,
  names=["TFIDF"]
)


[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [00:00] [18.7kB] [36.0MB/s]
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [00:00] [1.14MB] [8.05MB/s]


Total topics 50, and qrels: 69318


Unnamed: 0,name,AP,nDCG,nDCG@10,AP +,AP -,AP p-value,nDCG +,nDCG -,nDCG p-value,nDCG@10 +,nDCG@10 -,nDCG@10 p-value
0,TFIDF,0.188578,0.400915,0.638799,,,,,,,,,


# II. Learning to Rank


In this part of the course, you will build, learn, evaluate and analyze Learning to Rank pipelines.

First, let's divide the rankings into training, validation and test sets. TREC Covid has only 50 annotations, which is not enough for learning. We will divide up 30 annotations for training, 5 for validation and 15 for evaluation. We will also look at the statistical differences, albeit small, for 15 annotations.

We will only rank the first 10 documents for each query - we hope that learning to rank  will help us to re-sort the first 10 documents to make them more efficient.

In [None]:
RANK_CUTOFF = 10
SEED=42

from sklearn.model_selection import train_test_split
tr_va_topics, test_topics = train_test_split(topics, test_size=15, random_state=SEED)
train_topics, valid_topics =  train_test_split(tr_va_topics, test_size=5, random_state=SEED)


test_qrels = qrels # only the annotations of the topics in reply are used, so there's no problem if you use all of them
train_qrels = qrels
valid_qrels = qrels

## 1. Feature Set

Let's define our feature set.  We'll have a total of 6 features:

1. abstract score from TFIDF ;
2. whether the abstract contains “coronavirus covid”, noted by TFIDF ;
3. the TFIDF score on the title (even if we didn't index it earlier!);
4. was the article published in 2020? Here, we hypothesize that recent articles were more useful for this task;
5. Does the article have a DOI, i.e. is it an official publication?
6. the coordinate match score for the query - i.e. how many query terms appear in the abstract.

Many of these features require additional metadata `[“title”, “date”, “doi”]`. Fortunately, the TREC Covid dataset allows us to obtain more metadata after indexing. We use `pt.text.get_text(cord19, [“title”, “date”, “doi”])` to retrieve these additional metadata columns.

In [None]:
ltr_feats1 = (tfidf % RANK_CUTOFF) >> pt.text.get_text(cord19, ["title", "date", "doi"]) >> (
    pt.transformer.IdentityTransformer()
    ** # score of text for query 'coronavirus covid'
    (pt.apply.query(lambda row: 'coronavirus covid') >> tfidf)
    ** # score of title (not originally indexed)
    (pt.text.scorer(body_attr="title", takes='docs', wmodel='TF_IDF') )
    ** # date 2020
    (pt.apply.doc_score(lambda row: int("2020" in row["date"])))
    ** # has doi
    (pt.apply.doc_score(lambda row: int( row["doi"] is not None and len(row["doi"]) > 0) ))
    ** # abstract coordinate match
    pt.BatchRetrieve(indexref, wmodel="CoordinateMatch")
)

# for reference, lets record the feature names here too
fnames=["TFIDF", 'coronavirus covid', 'title', "2020", "hasDoi", "CoordinateMatch"]

  pt.BatchRetrieve(indexref, wmodel="CoordinateMatch")


Let's look at the result for a particular query. We can see that we now have additional document metadata columns `[“title”, “date”, “doi”]`, as well as the all-important `“features”` columns. Indeed, this is the column we use for learning.


In [None]:
ltr_feats1.search("Movie")



Unnamed: 0,qid,docid,docno,rank,score,query,title,date,doi,features
0,1,23347,qiwq0pe5,0,12.970648,Movie,Sentiment Analysis on Movie Scripts and Review...,2020-05-06,10.1007/978-3-030-49161-1_36,"[12.97064764826613, 0.0, 0.7458647062217206, 1..."
1,1,23343,vmetwotq,1,12.923459,Movie,Improving Movie Recommendation Systems Filteri...,2020-05-04,10.1007/978-3-030-49190-1_17,"[12.923458775835545, 0.0, 1.0350775514913675, ..."
2,1,78848,mmq44kwx,2,12.207116,Movie,"Smoking in top-grossing movies--United States,...",2011,,"[12.207115891520253, 0.0, 0.8744620693633965, ..."
3,1,70132,eynhsuz8,3,11.372312,Movie,The Post: A token woman leader's transformation,2020,10.1002/hrdq.21391,"[11.372311997669033, 1.499168191418449, 0.0, 1..."
4,1,24731,o7ckdng4,4,10.805086,Movie,Movies Emotional Analysis Using Textual Contents,2020-05-26,10.1007/978-3-030-51310-8_19,"[10.80508574921903, 0.0, 0.9277829272514087, 1..."
5,1,118935,25khbzk0,5,9.594602,Movie,CinemaGazer: a System for Watching Video at Ve...,2011-10-04,,"[9.594602481582845, 0.0, 0.0, 0.0, 0.0, 1.0]"
6,1,16001,3lhpdpiv,6,9.058977,Movie,The Aliens in Us and the Aliens Out There: Sci...,2013-11-17,10.1007/978-1-4614-7175-2_2,"[9.058977465455175, 0.0, 0.9880285718781234, 0..."
7,1,18722,gt3xayqp,7,8.579994,Movie,The Unfairness of Popularity Bias in Music Rec...,2020-03-24,10.1007/978-3-030-45442-5_5,"[8.579993656638317, 0.0, 0.0, 1.0, 1.0, 1.0]"
8,1,86076,opbwnnai,8,8.242861,Movie,Preparing for an influenza pandemic: mental he...,2009,,"[8.242860905714242, 0.0, 0.0, 0.0, 0.0, 1.0]"
9,1,5238,5z3pbbfb,9,8.042828,Movie,Characteristics of airborne Staphylococcus aur...,2014-05-07,10.1007/s10453-014-9342-6,"[8.042828188746888, 0.0, 0.0, 0.0, 1.0, 1.0]"


We can also look at the raw feature values (in this case, for the first ranked document). Note that the BM25 in the “score” column above is also the first value in the features table (close to 13), because we used an identity transformer.


In [None]:
ltr_feats1.search("Movie").iloc[0]["features"]



array([12.97064765,  0.        ,  0.74586471,  1.        ,  1.        ,
        1.        ])

## 2. Analysis

We analyze the performance of each feature independently. To do this, we compose the feature pipeline (`ltr_feats1`) with `pt.ltr.feature_to_score(i)` for a number of features $i$.

In [None]:
pt.Experiment(
    [ltr_feats1 >> pt.ltr.feature_to_score(i) for i in range(len(fnames))],
    test_topics,
    test_qrels,
    names=fnames,
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "num_rel_ret"])



  warn(




  warn(




  warn(




  warn(




  warn(




  warn(


Unnamed: 0,name,map,ndcg,ndcg_cut_10,num_rel_ret
0,TFIDF,0.010519,0.047832,0.589368,96.0
1,coronavirus covid,0.010816,0.047858,0.58478,96.0
2,title,0.0122,0.054099,0.647125,96.0
3,2020,0.010942,0.048751,0.591473,96.0
4,hasDoi,0.010437,0.04781,0.575902,96.0
5,CoordinateMatch,0.010204,0.046617,0.570048,96.0


Interestingly, the “coronavirus covid” feature achieved an NDCG@10 of 0.5847. It is therefore a strong baseline for this task.

## 3. Learning


In this part of the TP, we apply three different ranking learning techniques:

 - coordinate ascent from FastRank, a list-based linear technique
 - random forests from `scikit-learn`, a list-based regression tree technique
 - LambdaMART from LightGBM, a list-based regression tree technique.

In each case, we take our feature pipeline, `ltr_feats1`, and compose it (`>>`) with the learned model. We use `pt.ltr.apply_learned_model()` which knows how to handle different learners.

The complete pipeline is then fitted (learned) using `.fit()`, specifying training annotations and qrels. It's important to note that the previous pipeline steps (feature retrieval and computation) are applied to the training annotations to obtain the results, which are then passed on to the ranking learning technique. LightGBM has an early stop enabled, which uses a set of validation annotations - in the same way, the validation annotations are transformed into validation results.

Finally, `%time` is the “magic command” that displays the learning time for each technique. Each technique takes < 30 seconds to learn.

In [None]:
import fastrank

train_request = fastrank.TrainRequest.coordinate_ascent()

params = train_request.params
params.init_random = True
params.normalize = True
params.seed = 1234567

ca_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(train_request, form='fastrank')

%time ca_pipe.fit(train_topics, train_qrels)



  warn(


CPU times: user 5.44 s, sys: 113 ms, total: 5.56 s
Wall time: 4.99 s


In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=400, verbose=1, random_state=42, n_jobs=2)

# on utilisant ca_pipe comme exemple, proposez la définition de rf_pipe
rf_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(rf)

%time rf_pipe.fit(train_topics, train_qrels)



  warn(
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    0.4s


CPU times: user 4.02 s, sys: 124 ms, total: 4.14 s
Wall time: 3.09 s


[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    0.7s finished


In [None]:
import lightgbm as lgb

# this configures LightGBM as LambdaMART
lmart_l = lgb.LGBMRanker(
    task="train",
    silent=False,
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=1,
    max_bin=255,
    num_leaves=31,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[10],
    ndcg_at=[10],
    eval_at=[10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=100,
    early_stopping_rounds=5
)

lmart_x_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(lmart_l, form="ltr", fit_kwargs={'eval_at':[10]})

%time lmart_x_pipe.fit(train_topics, train_qrels, valid_topics, valid_qrels)



  warn(




  warn(


You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 268
[LightGBM] [Info] Number of data points in the train set: 1830, number of used features: 6
[1]	valid_0's ndcg@10: 0.871605
Training until validation scores don't improve for 5 rounds
[2]	valid_0's ndcg@10: 0.904581
[3]	valid_0's ndcg@10: 0.929706
[4]	valid_0's ndcg@10: 0.955958
[5]	valid_0's ndcg@10: 0.955958
[6]	valid_0's ndcg@10: 0.955958
[7]	valid_0's ndcg@10: 0.928186
[8]	valid_0's ndcg@10: 0.926953
[9]	valid_0's ndcg@10: 0.926363
Early stopping, best iteration is:
[4]	valid_0's ndcg@10: 0.955958
CPU times: user 3.61 s, sys: 40.4 ms, total: 3.65 s
Wall time: 3.55 s


## 4. Evaluation


Let's now compare our ranking pipelines on our 15 test annotations versus the BM25 baseline. In all cases, we're only ranking 10 results per query, so MAP will be significantly lower.

We'll report the average response time (`“mrt”`) as well as the MAP, NDCG and NDCG@10 metrics.

In [None]:
pt.Experiment(
    [tfidf % RANK_CUTOFF, ca_pipe, rf_pipe, lmart_x_pipe],
    test_topics,
    test_qrels,
    names=["TFIDF",  "TFIDF + CA(6f)", "TFIDF + RF(6f)", "TFIDF + LMart(6f)"],
    baseline=0,
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "mrt"])



  warn(




  warn(
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    0.1s finished




  warn(


Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt,map +,map -,map p-value,ndcg +,ndcg -,ndcg p-value,ndcg_cut_10 +,ndcg_cut_10 -,ndcg_cut_10 p-value
0,TFIDF,0.010519,0.047832,0.589368,52.137179,,,,,,,,,
1,TFIDF + CA(6f),0.0119,0.052403,0.63298,81.413537,9.0,1.0,0.045345,9.0,1.0,0.010426,9.0,1.0,0.0057
2,TFIDF + RF(6f),0.01198,0.052298,0.63466,100.631412,9.0,2.0,0.023384,9.0,2.0,0.033955,9.0,2.0,0.031752
3,TFIDF + LMart(6f),0.012822,0.054481,0.642113,75.6199,8.0,4.0,0.090902,8.0,4.0,0.119737,8.0,4.0,0.092779


#### Question ✍
3. Were the three models learned able to improve NDCG@10 on TFIDF? Analyze the results, indicating whether there are any statistical improvements.

Pour répondre à la question "Les trois modèles appris ont-ils permis d'améliorer le NDCG@10 par rapport à TFIDF ? Analysez les résultats, en indiquant s'il y a des améliorations statistiques", voici une analyse détaillée basée sur les résultats fournis.

### Résultats des modèles

| Modèle                 | NDCG@10   | p-value (NDCG@10)  |
|------------------------|-----------|--------------------|
| TFIDF                  | 0.047832  | NaN                |
| TFIDF + CA(6f)         | 0.052403  | 0.010426           |
| TFIDF + RF(6f)         | 0.052298  | 0.033955           |
| TFIDF + LMart(6f)      | 0.054481  | 0.119737           |

### Analyse des résultats

1. **Comparaison des NDCG@10** :
   - **TFIDF** a un NDCG@10 de **0.047832**.
   - **TFIDF + CA(6f)** (Coordinate Ascent avec 6 caractéristiques) a un NDCG@10 de **0.052403**, ce qui est une amélioration par rapport à TFIDF.
   - **TFIDF + RF(6f)** (Random Forest avec 6 caractéristiques) a un NDCG@10 de **0.052298**, ce qui est également légèrement meilleur que TFIDF.
   - **TFIDF + LMart(6f)** (LambdaMART avec 6 caractéristiques) a un NDCG@10 de **0.054481**, ce qui est l'amélioration la plus significative parmi les trois modèles.

2. **Amélioration statistique** :
   - Pour évaluer si ces améliorations sont statistiquement significatives, nous devons observer les **p-values** fournies.
     - **TFIDF + CA(6f)** a un p-value de **0.010426**, ce qui est inférieur à 0.05, indiquant que l'amélioration par rapport à TFIDF est statistiquement significative.
     - **TFIDF + RF(6f)** a un p-value de **0.033955**, également inférieur à 0.05, ce qui montre que l'amélioration par rapport à TFIDF est aussi statistiquement significative.
     - **TFIDF + LMart(6f)** a un p-value de **0.119737**, qui est supérieur à 0.05, ce qui suggère que l'amélioration par rapport à TFIDF n'est pas statistiquement significative. Bien que le NDCG@10 soit légèrement meilleur, la différence n'est pas suffisante pour être considérée comme significative d'un point de vue statistique.

### Conclusion

- **Les trois modèles appris (Coordinate Ascent, Random Forest et LambdaMART)** ont tous amélioré **NDCG@10** par rapport à **TFIDF**, mais les améliorations sont variables :
  - **Coordinate Ascent (CA)** et **Random Forest (RF)** montrent des améliorations statistiquement significatives par rapport à **TFIDF**, avec des p-values inférieures à 0.05.
  - **LambdaMART (LMart)**, bien qu'ayant une amélioration notable du NDCG@10, n'atteint pas un niveau de signification statistique (p-value > 0.05).

Ainsi, on peut conclure que **Coordinate Ascent (CA)** et **Random Forest (RF)** ont permis une amélioration statistiquement significative de **NDCG@10** par rapport à **TFIDF**, tandis que **LambdaMART (LMart)** a montré une amélioration, mais celle-ci n'est pas statistiquement significative.

## Application - Concatenation

Our learned model has low recall, as only 10 documents are reclassified. Let's create a small function, `append_baseline()`, which can add the results of baseline BM25 to the output of the learned model. This is defined using the [transformation operators] (https://pyterrier.readthedocs.io/en/latest/operators.html) (`^` and `%`).

As an exercise, apply `append_baseline()` to each of the learned model pipelines defined above, and report the MAP and NDCG calculated on the 1000 classified results.


#### Question ✍
4. Which of the learned models results in a significant improvement in MAP and NDCG?


In [None]:
#### Your code here

def append_baseline(system, baseline, max_results=1000):
    # Effectuer la récupération des résultats avec le système de référence (baseline)
    baseline_results = baseline % pt.text.get_text(cord19, ["title", "date", "doi"]) >> pt.BatchRetrieve(indexref, wmodel="TF_IDF")

    # Récupérer les résultats du système à évaluer
    system_results = system % pt.text.get_text(cord19, ["title", "date", "doi"]) >> pt.BatchRetrieve(indexref, wmodel="TF_IDF")

    # Limiter les résultats à 'max_results'
    baseline_results = baseline_results[:max_results]
    system_results = system_results[:max_results]

    # Combiner les résultats du système et de la baseline (par exemple, en les concatenant)
    combined_results = pt.merge([baseline_results, system_results])

    # Retourner l'expérience combinée
    return combined_results


# Application

#### Question ✍

Use the templates implemented for cord19 in a question-and-answer task. In this context, queries are questions and documents are documents that might contain the answer. Note that you'll need to redo the indexing as well as the other steps studied in this tutorial. Here's an example of the dataset to be used:

```json
"question": "Why are big companies like Apple or Google not included in the Dow Jones Industrial Average (DJIA) index?",

"answers":{
  "290156": {
    "text":" That is a pretty exclusive club and for the most part they are not interested in highly volatile companies like Apple and Google. Sure, IBM is part of the DJIA, but that is about as stalwart as you can get these days. The typical profile for a DJIA stock would be one that pays fairly predictable dividends, has been around since money was invented, and are not going anywhere unless the apocalypse really happens this year. In summary, DJIA is the boring reliable company index." ,
    "timestamp": "Sep 11 '12 at 0:53"}
 }

```
Similar to the previous PW, you can download the data with the following code:

In [None]:
fiqa = {}
fiqa['train'] = pt.datasets.get_dataset('irds:beir/fiqa/train')
fiqa['valid'] = pt.datasets.get_dataset('irds:beir/fiqa/dev')
fiqa['test'] = pt.datasets.get_dataset('irds:beir/fiqa/test')

test_topics = fiqa['test'].get_topics(variant='text')
test_qrels = fiqa['test'].get_qrels()

train_topics = fiqa['train'].get_topics(variant='text')
train_qrels = fiqa['train'].get_qrels()

valid_topics = fiqa['valid'].get_topics(variant='text')
valid_qrels = fiqa['valid'].get_qrels()

[INFO] [starting] opening zip file
[INFO] If you have a local copy of https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/17918ed23cd04fb15047f73e6c3bd9d9
[INFO] [starting] https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip
[INFO] [finished] https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip: [00:10] [17.9MB] [1.63MB/s]
[INFO] [finished] opening zip file [11.57s]
[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [0ms]
[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [0ms]
[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [1ms]


In [None]:
#### Your code here
if not pt.started():
    pt.init()

index_path = "./index"
if not os.path.exists(index_path):
    os.makedirs(index_path)
corpus_iter = fiqa['train'].get_corpus_iter()

def doc_iterator():
    for i, doc in enumerate(corpus_iter):
        yield {
            "docno": str(i),
            "text": doc['text']
        }


indexref = pt.IterDictIndexer(index_path).index(doc_iterator())


retriever = pt.BatchRetrieve(indexref)


  if not pt.started():
[INFO] [starting] building docstore
[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [4ms]
docs_iter: 100%|██████████████████████| 57638/57638 [00:02<00:00, 28189.38doc/s]
[INFO] [finished] docs_iter: [00:02] [57638doc] [28167.16doc/s]
[INFO] [finished] building docstore [2.05s]


beir/fiqa/train documents:   0%|          | 0/57638 [00:00<?, ?it/s]

20:40:06.314 [ForkJoinPool-2-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 39 empty documents


  retriever = pt.BatchRetrieve(indexref)
