# Homework 4, Part 2

In Part 1, we saw how to create a bi-encoder to estimate the relevance of a query-document pair and generate these relevance scores. In Part 2, we'll see how to integrate those scores into a learning to rank (L2R) model with a few features.

For this part, you are going to:
1. Create the dataset ready to use for Pyterrier.
2. integrate the cosine similarity you have got in part 1 into the features of learning to rank models.


Learning goals for Homework 4, Part 2:
* Improve familiarity with installing and running Pyterrier code
* Learn how to use L2R models in Pyterrier
* Learn how to add custom features to L2R models with Pyterrier.
* Deepen your understanding of how different models perform in mixed-domain settings (e.g., text queries / code docs)


### Step 0: install things as needed

In case you didn't do any of Homework 3 (which was extra credit), please be sure to have the following libraries installed and ready. The installation command is commented out for now but uncomment and run each as needed.

In [3]:
!pip install fastrank
!pip install lightgbm
!pip install python-terrier

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Task 1: Creating a dataset with precomputed features

## Task 1.1

Load in the dataset used for evaluation as a pandas data frame, which is in `final_evaluation_set.csv`. Then print the number of unique queries (99), unique code-documents in the dataset (958) to verify it was loaded correctly.

In [4]:
import pandas as pd
import numpy as np

In [117]:
# TODO
finalFile = pd.read_csv('final_evaluation_set.csv')
finalFile = pd.DataFrame(finalFile)
# finalFile.insert(0, 'docid', range(len(finalFile)))
# finalFile['docid'] = finalFile['docid'].astype(str)
finalFile = finalFile.rename(columns = {'code':'text'})

uni_query = set(finalFile['Query'])
uni_code = set(finalFile['text'])

df_code = pd.DataFrame(list(uni_code), columns=['text'])
df_code.insert(0, 'docno', range(1,len(df_code)+1))
df_code['docno'] = df_code['docno'].astype(str)

print(len(uni_query))
print(len(df_code))
df_code


99
958


Unnamed: 0,docno,text
0,1,"def linear_regression(self, target, regression..."
1,2,def ConsoleType(t=gtk.TextView): class console...
2,3,"def _do_post(self, url, **kwargs): """""" Convini..."
3,4,"def readcsv(fn): """""" Wrapper to read arbitrary..."
4,5,"def scatter(self, ax, X, Y, Z=None, color=Tang..."
...,...,...
953,954,"def decode_longitude(self, longitude): match =..."
954,955,"def unzip_unicode(output, version): """"""Unzip t..."
955,956,"def get_enum_from_name(self, enum_name): """""" R..."
956,957,"def __call__(self, value): for substring in se..."


## Task 1.2: Creating an index  (5 points)

Since the code documents are text, we can still create an index to store them (just like regular documents before). Before, we mostly used pre-built indices or loaded them from file. In this part, you'll see how to create your own index from a pandas dataframe. 

The rough steps are as follows:
* Start pyterrier
* Map each unique code document to a unique string identifier (keep this around in a dictionary!)
* Create a pandas DataFrame of each unique code-document with two columns:
  * `text` containing the contents of the code-document 
  * `docid` a unique string identifier for that code-document
* use pyterrier's [`DFIndexer`](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html) to create an index from the data frame.

Once you're finished with these steps, print the collection statistics, which should look something like this:
```
Number of documents: 958
Number of terms: 4929
Number of postings: 26358
Number of fields: 0
Number of tokens: 65017
Field names: []
Positions:   false
```

In [6]:
# TODO: Set this based on where Java is installed
!export JAVA_HOME=/usr/lib/jvm/java-18-openjdk-amd64/

In [7]:
!which java

/usr/bin/java


In [118]:
# TODO
import pyterrier as pt
import os
if not pt.started():
    pt.init()
    
index_dir = './final_index'
indexer = pt.DFIndexer(index_dir, overwrite=True)
index_ref = indexer.index(df_code["text"], df_code["docno"])
index_ref.toString()

index = pt.IndexFactory.of(index_ref)

print(index.getCollectionStatistics().toString())

Number of documents: 958
Number of terms: 4929
Number of postings: 26358
Number of fields: 0
Number of tokens: 65017
Field names: []
Positions:   false



## Task 1.3: Preparing the query data

We'll be using Pyterrier's `Experiment` framework to do our evaluation so we'll need to organize our queries in the test set into a pandas `DataFrame`. Create a new dataframe for all unique queries with two columns:
* `query` the text of the query
* `qid` a unique string identifier for that query

In [119]:
# TODO

query_df = pd.DataFrame(list(uni_query), columns=['query'])
query_df.insert(0, 'qid', range(1,len(query_df)+1))
query_df['qid'] = query_df['qid'].astype(str)
query_df

Unnamed: 0,qid,query
0,1,get executable path
1,2,format date
2,3,encrypt aes ctr mode
3,4,json to xml conversion
4,5,httpclient post json
...,...,...
94,95,extract data from html content
95,96,html encode string
96,97,finding time elapsed using a timer
97,98,convert int to bool


## Task 1.4: Preparing the Evaluation data

In the final step, we'll create a single data frame that contains the queries, documents, and true relevance scores, which we'll use to evaluate our models using `pt.Experiment`. Your dataframe should have three columns:
* `qid` the unique string identifier for a query
* `docno` the unique string identifier for a code-document
* `label` the relevance score for that query-document pair

In [120]:
# TODO
final_df = finalFile[['Query', 'text', 'relevance']]
# print(final_df)
# final_df[final_df['Query'] == ]

# qid_docno_dict = {}
qid_docno = []


def lookup_row(row):
  a = []
  qid = query_df[query_df['query'] == row['Query']]['qid']
  docno = df_code[df_code['text'] == row['text']]['docno']
  # qid_docno_dict[int(qid), int(docno)] = int(row['relevance'])
  a.extend(qid)
  a.extend(docno)
  a.extend([row['relevance']])
  # print(a)
  qid_docno.append(a)

bb = final_df.apply(lookup_row, axis = 1)
qrels_df = pd.DataFrame(qid_docno, columns = ['qid','docno', 'label'])

# all_df = []
# for q in query_df['qid']:
#   for d in df_code['docno']:
#     if (int(q), int(d)) in qid_docno_dict:
#       all_df.append([q,d,int(qid_docno_dict[int(q), int(d)])])
#     else:
#       all_df.append([q,d,''])
# qrels_df = pd.DataFrame(all_df, columns = ['qid','docno', 'label'])


# Task 2: Learning to Rank

The steps in Task 2 will have you running some evaluations and setting up a Learning to Rank model that we'll extend later to incorporate the bi-encoder features.

First, we'll split our labeled query-document data into train, development, and test sets so we can train models and evaluate unsupervised models.

In [121]:
SEED=42
from sklearn.model_selection import train_test_split
tr_va_topics, test_topics = train_test_split(query_df, test_size=30, random_state=SEED)
train_topics, valid_topics =  train_test_split(tr_va_topics, test_size=10, random_state=SEED)

## Task 3.1: Test baseline models (5 points)

In this initial step, create two `BatchRetrieve` rankers that use "BM25" or "TF_IDF" and run an `pt.Experiment` using them on the code index, using "map" and "ndcg" to evaluate their performance. We'll evaluate these only on the test data (no hyperparameter fine-tuning).

In [122]:
# TODO
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")

pt.Experiment(
    [bm25, tfidf],
    query_df,
    qrels_df,
    eval_metrics=["map", "ndcg"])


Unnamed: 0,name,map,ndcg
0,BR(BM25),0.76065,0.823841
1,BR(TF_IDF),0.76082,0.826136


## Task 3.2: Creating our first pipeline (10 points)

Let's start getting more complex with our pipelines. Create a feature pipeline that has three features:
1.   the BM25 code score;
2.   the TF-IDF code score;
3.   the coordinate match score for the query--i.e. how many query terms appear in the code;

We'll use these features later in learning to rank. Fo

In [123]:
# TODO

ltr_feats1 = bm25 >> pt.transformer.IdentityTransformer() ** tfidf ** pt.BatchRetrieve(index, wmodel="CoordinateMatch")
# qrels = qrels_df.rename(columns = {'label':'score'})
qrels = qrels_df



## Setting up the Learning to Rank (L2R) models

For the next part, you won't need to write any code (we've done it for you) but you will need to run the cells to train a few different kinds of L2R models on the training set. Each of the models captures a different kind of L2R that we talked about.

Train the following three models on our training set:
 - random forests from `scikit-learn`, a pointwise regression tree technique
 - coordinate ascent from FastRank, a listwise linear technique
 - LambdaMART from LightGBM, a listwise regression tree technique

In [124]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=400, verbose=1, random_state=SEED, n_jobs=2)

rf_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(rf)

%time rf_pipe.fit(train_topics, qrels)

# print(qrels)
# %time rf_pipe.fit(train_topics, qrels_df)


[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.4s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    1.6s


CPU times: user 7.72 s, sys: 76.9 ms, total: 7.8 s
Wall time: 5.2 s


[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    3.4s finished


In [125]:
import fastrank
train_request = fastrank.TrainRequest.coordinate_ascent()

params = train_request.params
params.init_random = True
params.normalize = True
params.seed = 1234567

ca_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(train_request, form='fastrank')

%time ca_pipe.fit(train_topics, qrels)

CPU times: user 4.12 s, sys: 22.2 ms, total: 4.14 s
Wall time: 2.99 s


In [126]:
import lightgbm as lgb

# this configures LightGBM as LambdaMART
lmart_l = lgb.LGBMRanker(
    task="train",
    silent=False,
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=1,
    max_bin=255,
    num_leaves=31,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[10],
    ndcg_at=[10],
    eval_at=[10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=100,
    early_stopping_rounds=5
)

lmart_x_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(lmart_l, form="ltr", fit_kwargs={'eval_at':[20]})

%time lmart_x_pipe.fit(train_topics, qrels, valid_topics, qrels)

[1]	valid_0's ndcg@20: 0.586793
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's ndcg@20: 0.702743
[3]	valid_0's ndcg@20: 0.707713
[4]	valid_0's ndcg@20: 0.676361
[5]	valid_0's ndcg@20: 0.700949
[6]	valid_0's ndcg@20: 0.683647
[7]	valid_0's ndcg@20: 0.699805
[8]	valid_0's ndcg@20: 0.696416
Early stopping, best iteration is:
[3]	valid_0's ndcg@20: 0.707713
CPU times: user 2.25 s, sys: 93 ms, total: 2.34 s
Wall time: 2.22 s




## Task 3.4: Comparing L2R performance (10 points)

Now that we have all of our models, let's compare them with the baselines we had before. Run another `Experiment` that compare the three L2R models with the two baselines (BM25 and tf-idf). This time, we'll add "ndcg_cut_10" to see their performance on just the top 10 docs and "mrt" to see how fast the models are.

In [149]:
# TODO

pt.Experiment(
    [bm25 , tfidf, ca_pipe, rf_pipe , lmart_x_pipe ],
    query_df,
    qrels,
    names = ['bm25', 'tfidf', 'ca_pipe', 'rf_pipe', 'lmart_x_pipe'],
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "mrt"])



[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    0.2s
[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    0.3s finished


Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt
0,bm25,0.76065,0.823841,0.740103,3.897158
1,tfidf,0.76082,0.826136,0.740142,3.757861
2,ca_pipe,0.765443,0.832403,0.745021,38.839609
3,rf_pipe,0.817961,0.886145,0.82531,45.126647
4,lmart_x_pipe,0.730713,0.81035,0.713836,28.634719


# Task 4: Incorporating new features

We didn't expect those approaches to do too well since queries might not reflect the content in the code-documents. But our bi-encoder model knows how to compare both! In Task 4's steps, you'll incorporate it's relevance predictions into the model as another feature.

**Note**: For your course projects, if you use Pyterrier, this code should give you some idea of how to incorporate ranking features (or other information) that you've calculated from elsewhere.

## Task 4.1 Loading in the precomputed relevance data

Read in the dataframe with the bi-encoder's estimated relevance scores for each query-document pair (i.e., its cosine similarity), which we produced in Part 1. The length of the dataframe should be (number of unique query) * (number of unique documents).

In [138]:
# TODO

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
import numpy as np
import heapq
import spacy
from tqdm import tqdm

bi_encoder = pd.read_csv('bi-encoder.csv')
bi_encoder['sim'].astype('float')
bi_encoder = bi_encoder.rename(columns = {'sim':'label'})

## Task 4.2: Adding new features (10 points)

Once we have our bi-encoder estimates, we'll create a new pipeline that adds the score as a new feature. Recall that Pyterrier's [Pipeline](https://pyterrier.readthedocs.io/en/latest/pipeline_examples.html) is a transformation on a pandas `DataFrame` object. For us, that means we can write a function that operates on each row of the data frame and use pyterrier's [`apply`](https://pyterrier.readthedocs.io/en/latest/apply.html) (whhich is much like pand'as [`apply`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)). Specifically, we'll write some code that for a given row with a document and query, looks up the precomputed relevance score.

While there's many ways to do this, your steps should probably look something like this:
* Create some data structure that can map a tuple of the query id and document id to the bi-encoder's relevance score
* Write a function takes in a row from a `DataFrame` and uses the query id and document id in the row's columns to look up the bi-encoder's relevance.
* Copy and extend your earlier pipeline by adding one new feature that uses pyterrier's `apply` function with your new function. Call this new pipeline `bienc_ltr_feats` so the later training functions can use it

Once you have this pipeline in place, use the code below to retrain the models. 

Add the feature of cosine similarity between query and code embedding into the feaure pipeline. Train the three models and run the experiements again.

In [145]:
from decimal import localcontext
# TODO
# ltr_feats1 = bm25 >> pt.transformer.IdentityTransformer() ** tfidf ** pt.BatchRetrieve(index, wmodel="CoordinateMatch")

bi_encoder_dict = {}
def _to_dict(row):
  bi_encoder_dict[int(row['qid']), int(row['docno'])] = float(row['label'])
bi_encoder.apply(_to_dict,axis = 1)


def lookup_fct(key):
  return bi_encoder_dict[key]
bienc_ltr_feats =  bm25 >> pt.transformer.IdentityTransformer() ** tfidf  ** (pt.apply.lookup_fct(lambda row: (row['qid'], row['docno']))) ** pt.BatchRetrieve(index, wmodel="CoordinateMatch")

# def lookup_fct(row):
#   r = dict(row)
#   key = (int(r['qid']), int(r['docno']))
#   return bi_encoder_dict[key]
# row = {'qid':'2', 'docno':'5'}
# a = lookup_fct(row)
# a
# bienc_ltr_feats =  bm25 >> pt.transformer.IdentityTransformer() ** tfidf  ** (bm25 >> pt.apply.doc_score(lookup_fct)) ** pt.BatchRetrieve(index, wmodel="CoordinateMatch")
# bienc_ltr_feats = ltr_feats1


In [152]:
rf = RandomForestRegressor(n_estimators=400, verbose=1, random_state=SEED, n_jobs=2)
rf_pipe = bienc_ltr_feats >> pt.ltr.apply_learned_model(rf)
rf_pipe.fit(train_topics, qrels)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.5s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    2.1s
[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    4.3s finished


In [153]:
train_request = fastrank.TrainRequest.coordinate_ascent()
params = train_request.params
params.init_random = True
params.normalize = True
params.seed = 1234567
ca_pipe = bienc_ltr_feats >> pt.ltr.apply_learned_model(train_request, form='fastrank')
ca_pipe.fit(train_topics, qrels)

In [154]:
lmart_l = lgb.LGBMRanker(
    task="train",
    silent=False,
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=1,
    max_bin=255,
    num_leaves=31,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[10],
    ndcg_at=[10],
    eval_at=[10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=100,
    early_stopping_rounds=5
)
lmart_x_pipe.fit(train_topics, qrels, valid_topics, qrels)

[1]	valid_0's ndcg@20: 0.586793
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's ndcg@20: 0.702743
[3]	valid_0's ndcg@20: 0.707713
[4]	valid_0's ndcg@20: 0.676361
[5]	valid_0's ndcg@20: 0.700949
[6]	valid_0's ndcg@20: 0.683647
[7]	valid_0's ndcg@20: 0.699805
[8]	valid_0's ndcg@20: 0.696416
Early stopping, best iteration is:
[3]	valid_0's ndcg@20: 0.707713




## Task 4.3 Re-run the experiment here using the new features! (10 points)

In [155]:
pt.Experiment(
    [bm25, tfidf, ca_pipe, rf_pipe, lmart_x_pipe],
    query_df,
    qrels,
    names = ['bm25', 'tfidf', 'ca_pipe', 'rf_pipe', 'lmart_x_pipe'],
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "mrt"])
# TODO

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    0.3s finished


Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt
0,bm25,0.76065,0.823841,0.740103,3.380184
1,tfidf,0.76082,0.826136,0.740142,3.268419
2,ca_pipe,0.76671,0.830632,0.745759,34.894478
3,rf_pipe,0.817961,0.886145,0.82531,39.911935
4,lmart_x_pipe,0.730713,0.81035,0.713836,25.324495


# _Optional_: Evaluating the different models (20 points total; this is part 2)

How much training does the model actually need to recognize relevance? Would one epoch be enough? What if we did 10? or 100? (100 might be too many for Great Lakes limits...). In this **optional part**, we'll describe a series of steps you can take to explore this part!
 
The instructions in Part 1 had you update that notebook to save the model after each epoch and then generate relevance predictions for each, saving those to a file. In Part 2, we'll load those files and compare the performance:
 
Here's what you need to do:
* Using the code from the blocks above, create new version of the test data DataFrame that have predictions from each trained bi-encoder model. (i.e., you should have predictions from the model trained on one epoch worth of data, predictions from the model trained on two epochs, etc.)
* Retrain each L2R model using each of these new features, using just one feature at a time. This should give you (number of L2R models) * (number of different-epoch-trained-biencoder-models) worth of results.
* Create a line plot where
  * the x-axis is the number of epochs the bi-encoder model was trained
  * the y-axis is NDCG_cut_10
  * there are different lines for each L2R model (with different colors/hues for each model)
 
This plot should show you how much the bi-encoder's training time influences the scores. Compare that with the F1 performance plot you produced for Part 1. Does increasing F1 performance lead to increasing NDCG@10? How many epochs do you think you need to train to maximize performance?

**TODO:** For full credit, submit a separate doc/pdf with the plots from Parts 1 and 2 and a short paragraph describing your observations on the performance (see the questions above).