# NIR 2022 - Lab 7: Learning to Rank in PyTerrier

Learning to Rank (LTR) refers to the application of re-ranking a candidate set of retrieved documents.

By manually engineering features and assigning them to each document, LTR techniques aim at getting the top-ranked documents ranked correctly.
Three main types of loss functions are used:
- Pointwise: One instance of the set is considered at a time, predicting how relevant it is in the current query. At inference, use predicted relevance scores for each document to order the set.
- Pairwise: A pair of instances is chosen and the order of those two is predicted. At inference, repeat this for each pair of documents for the given query to find the final order of the entire query.
- Listwise: Find the optimal order (most relevant document at the top of the ranking) by considering many or all instances at once.

Different models are commonly used: linear, trees and neural networks.

In this lab, we will look at tree-based approaches, trained using either pointwise or listwise learning objectives.

The material for this lab is largely based on the PyTerrier ECIR 2021 tutorial.

## Data and PyTerrier Setup

In [6]:
# Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

In [7]:
# Check that you can `ls` your directory with NIR material
# !ls "/content/drive/My Drive/nir2021"

In [13]:
# Load the data
import pandas as pd

# BASEDIR = "/content/drive/My Drive/nir2021/"
BASEDIR = './'
# corpus
docs_df = pd.read_csv(BASEDIR + 'data/lab_docs.csv', dtype=str)
print(docs_df.shape)
print(docs_df.head())

# topics
topics_df = pd.read_csv(BASEDIR + 'data/lab_topics.csv', dtype=str)
print(topics_df.shape)
print(topics_df.head())

# Load qrels
qrels_df = pd.read_csv(BASEDIR + 'data/lab_qrels.csv',dtype=str)
print(qrels_df.shape)
print(qrels_df.head())

(2453, 2)
     docno                                               text
0   935016  he emigrated to france with his family in 1956...
1  2360440  after being ambushed by the germans in novembe...
2   347765  she was the second ship named for captain alex...
3  1969335  world war ii was a global war that was under w...
4  1576938  the ship was ordered on 2 april 1942 laid down...
(9, 2)
       qid                 query
0  1015979    president of chile
1     2674    computer animation
2   340095  2020 summer olympics
3  1502917         train station
4     2574       chinese cuisine
(2454, 4)
       qid    docno label iteration
0  1015979  1015979     2         0
1  1015979  2226456     1         0
2  1015979  1514612     1         0
3  1015979  1119171     1         0
4  1015979  1053174     1         0


In [9]:
# !pip install python-terrier==0.5.0

[0mCollecting python-terrier==0.5.0
  Downloading python-terrier-0.5.0.tar.gz (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.1/74.1 KB[0m [31m839.7 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: python-terrier
  Building wheel for python-terrier (setup.py) ... [?25ldone
[?25h  Created wheel for python-terrier: filename=python_terrier-0.5.0-py3-none-any.whl size=79551 sha256=e20f25114a9044a5dfc23d44b39ba92f227409e82064b2b3f2b8c06d13de11cd
  Stored in directory: /maps/projects/futhark1/data/wzm289/.cache/pip/wheels/9f/f3/5f/4c8a196749598775e042028034c1c87b2e1525543481446b15
Successfully built python-terrier
[0mInstalling collected packages: python-terrier
  Attempting uninstall: python-terrier
[0m    Found existing installation: python-terrier 0.8.1
    Uninstalling python-terrier-0.8.1:
      Successfully uninstalled python-terrier-0.8.1
[0mSuccessfully install

In [2]:
# Init PyTerrier
import pyterrier as pt
if not pt.started():
    pt.init()

PyTerrier 0.8.1 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [3]:
# Build index
indexer = pt.DFIndexer(BASEDIR + "./indexes/default", overwrite=True, blocks=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 2453
Number of terms: 23693
Number of postings: 208487
Number of fields: 0
Number of tokens: 273373
Field names: []
Positions:   true



In [4]:
# Build IR systems
tf = pt.BatchRetrieve(index, wmodel="Tf")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

## Learning to Rank

We will now look at how to construct, train and evaluate LTR pipelines in PyTerrier.

### Data Splitting

First, let's split out topics into train, validation and test sets. 

Our lab data only has 9 topics, which is ridiculously small for training. 
We will split these into: 4 for training, 2 for validation and 3 for evaluation.

In [5]:
from sklearn.model_selection import train_test_split

SEED=42

tr_val_topics, test_topics = train_test_split(topics_df, test_size=3, random_state=SEED)
train_topics, valid_topics = train_test_split(tr_val_topics, test_size=2, random_state=SEED)

### Feature Set

In order to learn a mapping between a document and its relevance score for a given query, our LTR model needs query-document features.
That is, each query-document pair is represented by a multi-dimensional feature vector (each dimension of the vector is a feature) indicating how relevant or important the document is with respect to the query with respect to each feature.

Here, for the sake of simplicity, we only consider three features:
1. the BM25 score;
2. the TF score;
3. the TF-IDF score.

In your project, you should explore more established information retrieval features (see the [LETOR paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/letor3.pdf)) and more relevant features (e.g. was the article published after 2019?).

Today, we will re-rank the top-K (STAGE1_CUTOFF) documents for each query and evaluate the top-100 (STAGE2_CUTOFF) ones.

In [6]:
STAGE1_CUTOFF = 100

# We retrieve the top (% operator) STAGE1_CUTOFF documents
# And we concatenate (** operator) their BM25, TF and TF-IDF scores as features
ltr_feats1 = (bm25 % STAGE1_CUTOFF) >> (bm25 ** tf ** tfidf)

In [7]:
# Example of stage1 output
ltr_feats1.search("train")

Unnamed: 0,qid,docid,docno,rank,score,query,features
0,1,236,234372,0,4.455773,train,"[4.455772798261049, 12.0, 2.983525429580879]"
1,1,2057,1418389,1,4.185537,train,"[4.185537262862662, 6.0, 2.8025793561742307]"
2,1,1801,2400360,2,4.177455,train,"[4.177454903290807, 7.0, 2.7971675171049095]"
3,1,2095,1441398,3,4.149063,train,"[4.149062711261565, 7.0, 2.778156487872484]"
4,1,1005,2373010,4,3.942148,train,"[3.942147700083634, 5.0, 2.639608984316581]"
...,...,...,...,...,...,...,...
95,1,2231,1129049,95,3.031105,train,"[3.0311049015705924, 2.0, 2.029587001629092]"
96,1,18,1556726,96,3.023598,train,"[3.0235984921969976, 2.0, 2.0245608110522957]"
97,1,728,1556725,97,3.023598,train,"[3.0235984921969976, 2.0, 2.0245608110522957]"
98,1,151,2029140,98,3.016129,train,"[3.0161291696212325, 2.0, 2.0195594532956265]"


### Learning

We now train two learning to rank techniques:
- [Random forests from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), a pointwise regression tree technique.
- LambdaMART from [LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html), a pairwise regression tree technique.

In each case, we take our feature pipeline, `ltr_feats1`, and we compose it (`>>` operator) with the learned model. 
We use PyTerrier's `pt.ltr.apply_learned_model()` interface to directly access the different learners.

The full pipeline is then fitted (learned) using `.fit()`, specifying the training topics and qrels. 

Importantly, the preceeding stages of the pipeline (retrieval and feature calculation) are applied to the training topics in order to obtained the results, which are then passed to the learning to rank technique.

**NB:** Usually, only the documents with associated train qrels are used for LTR. This means that a small K (STAGE1_CUTOFF) might lead to fewer observed query-document-score data points. On the other hand, choosing a large K might be unfeasible due to long training time.


#### Bootstrap Aggregation: Random Forest

Random forest is a supervised learning algorithm that relies on ensemble learning method for classification and regression.

Decision trees in random forests are run in parallel, with no interaction between any two trees while building them.
After constructing a multitude of decision trees at training time, a random forest outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

To prevent growing highly correlated trees, random forests introduce two modifications:
- The number of features that can be split on at each node can be limited to some percentage of the total (a hyperparameter), ensuring a fair use of all potentially predictive features.
- Each tree can draw a random sample from the original dataset when generating its splits (known as bootstrapping), adding a further element of randomness that prevents overfitting.

In [8]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, verbose=1, random_state=SEED, n_jobs=2)

rf_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(rf)

rf_pipe.fit(train_topics, qrels_df)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.1s finished


#### Boosting: LambdaMART

Boosting refers to algorithms that utilize weighted averages to make weak learners into stronger learners. In boosting, each model that runs, defines which features the next model will focus on.
That is, a model is learnt from another, which in turn boosts the learning.

In this lab, we will use [LightGBM](https://github.com/microsoft/LightGBM) to implement LambdaMART, a pairwise technique based on gradient boosted decision trees with a cost function derived from LambdaRank.

Light GBM (LGBM) is a gradient boosting framework that uses tree based learning algorithm.
LGBM can handle large data and it is memory efficient. Another reason of why Light GBM is popular is because it focuses on accuracy of results. LGBM also supports GPU learning for quick development.
However, it is not advisable to use LGBM on small datasets as it is sensitive to overfitting.
While the implementation of LGBM is easy, hyperparameter tuning may not. Check out [this blogpost](https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc) for a good description of available parameters.


Another popular library for gradient boosting algorithms is [XGBoost](https://github.com/dmlc/xgboost).


In [9]:
import lightgbm as lgb

# this configures LightGBM as LambdaMART
lmart_l = lgb.LGBMRanker(
    task="train",
    silent=False,
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=1,
    max_bin=255,
    num_leaves=31,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[10],
    ndcg_at=[10],
    eval_at=[10],
    learning_rate=0.1,
    importance_type="gain",
    num_iterations=10,
    early_stopping_rounds=5
)

lmart_x_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(lmart_l, form="ltr", fit_kwargs={'eval_at':[10]})

# LightGBM has early stopping enabled, which uses a validation topics set
lmart_x_pipe.fit(train_topics, qrels_df, valid_topics, qrels_df)



You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 223
[LightGBM] [Info] Number of data points in the train set: 349, number of used features: 3
[1]	valid_0's ndcg@10: 0.689492
[2]	valid_0's ndcg@10: 0.668603
[3]	valid_0's ndcg@10: 0.668603
[4]	valid_0's ndcg@10: 0.668603
[5]	valid_0's ndcg@10: 0.668603
[6]	valid_0's ndcg@10: 0.668603


### Evaluation

Finally, we now compare our ranking pipelines on our 3 test topics with the BM25 baseline. In all cases, we rank only 100 (STAGE2_CUTOFF) results per query.

We'll report MAP, NDCG and NDCG@10 measures as well as mean response time (`"mrt"`).

In [14]:
STAGE2_CUTOFF = 100
qrels_df = qrels_df.astype({'label': 'int32'})
pt.Experiment(
    [bm25 % STAGE2_CUTOFF, rf_pipe % STAGE2_CUTOFF, lmart_x_pipe % STAGE2_CUTOFF],
    test_topics,
    qrels_df, 
    names=["BM25", "BM25 + RF", "BM25 + Lmart"],
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "mrt"])

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.0s finished


Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt
0,BM25,0.319667,0.528478,0.818269,17.131405
1,BM25 + RF,0.262618,0.496678,0.761424,88.714593
2,BM25 + Lmart,0.223307,0.478142,0.671745,54.55159
