# LambdaMART algorithm

LambdaMART is perhaps the most popular "classic" Learning to Rank model. It's a [Gradient Boosted Decision Tree model](https://scikit-learn.org/stable/modules/ensemble.html). We'll get more into what that means in a bit.

What's important to note is why LambdaMART works well. Search typically has a wide array of features. Sometimes you have similarity calculations output from embedding. Sometimes you have other traditional search scores like BM25. Or you have statistics like popularity or recency or popularity to take into account.

It's been noted that for this kind of tabular data, [tree based models often outperform deep learning models](https://arxiv.org/abs/2207.08815).

Additionally, tree based models are easier to interpret. We can see the decisions being made in the model as a series of if statements around the feature values. Interpretability is very important. They also integrate well with existing search systems and query syntax, that let us mix and match different relevance signals.

First we'll overview using [these slides](https://docs.google.com/presentation/d/1LW2Nmy7GeTFGchUibnFR_BDas6ruBVo0YM72hFX9Hhs/edit#slide=id.g149c47b1ee6_1_0)

In [1]:
from aips import get_engine, indexer
engine = get_engine()

In [2]:
indexer.download_data_files("tmdb") # -> Holds "tmdb.json", big json dict with corpus
indexer.download_data_files("judgments") # -> Holds "ai_pow_search_judgments.txt", which is our labeled judgment list
tmdb_collection = indexer.build_collection(engine, "tmdb")

## Precanned training set

For our purposes, we've generated a training set with the title and overview 'match' BM25 scores already logged. It has the usual columns, we'll just point out the additions:

* qid - the query id, uniquely identifying a query
* features - an array holding the `[title, overview]` features (BM25 score of query in each)

In [3]:
import requests
from io import BytesIO
import pandas as pd


pkl_file = requests.get("http://softwaredoug.com/data/title_judgments_logged.json.gz")
judgments = pd.read_json(BytesIO(pkl_file.content), compression='gzip')
judgments = judgments.rename(columns={'docId': 'doc_id'})
judgments

Unnamed: 0,uid,qid,keywords,doc_id,grade,features
0,17555,1,rambo,7555,4,"[11.657399, 10.083591]"
1,11370,1,rambo,1370,3,"[9.456276, 13.265001]"
2,11369,1,rambo,1369,3,"[6.036743, 11.113943]"
3,113258,1,rambo,13258,2,"[0.0, 6.8695450000000005]"
4,11368,1,rambo,1368,4,"[0.0, 11.113943]"
...,...,...,...,...,...,...
1385,4037079,40,star wars,37079,0,"[0.0, 0.0]"
1386,40126757,40,star wars,126757,0,"[0.0, 0.0]"
1387,4039797,40,star wars,39797,0,"[0.0, 0.0]"
1388,4018112,40,star wars,18112,0,"[0.0, 0.0]"


## Initialize 'prediction' ~ a relevance score

Prediction is what our model will attempt to approximate given the features. You can think of it as an arbitrary score that attempts to rank by DCG (more on this in a bit).

Here we initialize it to 0

In [4]:
lambdas_per_query = judgments.copy()


lambdas_per_query['last_prediction'] = 0.0  # 'relevance score' f(title, overview) | DCG
lambdas_per_query.sort_values(['qid', 'last_prediction'], ascending=[True, False], kind='stable')

Unnamed: 0,uid,qid,keywords,doc_id,grade,features,last_prediction
0,17555,1,rambo,7555,4,"[11.657399, 10.083591]",0.0
1,11370,1,rambo,1370,3,"[9.456276, 13.265001]",0.0
2,11369,1,rambo,1369,3,"[6.036743, 11.113943]",0.0
3,113258,1,rambo,13258,2,"[0.0, 6.8695450000000005]",0.0
4,11368,1,rambo,1368,4,"[0.0, 11.113943]",0.0
...,...,...,...,...,...,...,...
1385,4037079,40,star wars,37079,0,"[0.0, 0.0]",0.0
1386,40126757,40,star wars,126757,0,"[0.0, 0.0]",0.0
1387,4039797,40,star wars,39797,0,"[0.0, 0.0]",0.0
1388,4018112,40,star wars,18112,0,"[0.0, 0.0]",0.0


## Compute per-position DCG stats

If our prediction is a relevance score, what happens when we sort by that score? Recall from previous notebooks the DCG stats we compute per rank.


In [5]:
import numpy as np

# Sort by our prediction
lambdas_per_query.sort_values(['qid', 'last_prediction'], ascending=[True, False], kind='stable')
lambdas_per_query['display_rank'] = lambdas_per_query.groupby('qid').cumcount()

# Compute stats for display rank
lambdas_per_query['discount'] = 1 / np.log2(2 + lambdas_per_query['display_rank']) # How much weight for this position
lambdas_per_query['gain'] = (2**lambdas_per_query['grade'] - 1)   # the 'gain' of this rank basically 'grade'

lambdas_per_query[['qid', 'display_rank', 'discount', 'grade', 'gain', 'features']]

Unnamed: 0,qid,display_rank,discount,grade,gain,features
0,1,0,1.000000,4,15,"[11.657399, 10.083591]"
1,1,1,0.630930,3,7,"[9.456276, 13.265001]"
2,1,2,0.500000,3,7,"[6.036743, 11.113943]"
3,1,3,0.430677,2,3,"[0.0, 6.8695450000000005]"
4,1,4,0.386853,4,15,"[0.0, 11.113943]"
...,...,...,...,...,...,...
1385,40,25,0.210310,0,0,"[0.0, 0.0]"
1386,40,26,0.208015,0,0,"[0.0, 0.0]"
1387,40,27,0.205847,0,0,"[0.0, 0.0]"
1388,40,28,0.203795,0,0,"[0.0, 0.0]"


## Pairwise swaps

Many Learning to Rank algorithms are **pairwise**. Instead of predicting the grades directly, we want to see what happens to our list-wise statistic (DCG) when we swap the display rank of two results.

Essentially we simulate every possible move in a query, and get a result of the DCG swap. `display_rank_x` is like a 'before' while `display_rank_y` is an after. Here `delta` is the corresponding impact to DCG (the absolute value).

We're getting close to something we can train - `delta` is actually going to become what we train on, after a bit of finegling.

Reference corresponding [google slides](https://docs.google.com/presentation/d/1LW2Nmy7GeTFGchUibnFR_BDas6ruBVo0YM72hFX9Hhs/edit#slide=id.g149c47b1ee6_1_0) to explore this in detail.

In [6]:
import numpy as np

# each group paired with each other group
swaps = lambdas_per_query.merge(lambdas_per_query, on='qid', how='outer')
# Why is the diff not? This is what my slides say
# (discount x * gain x) - (discount y * gain y) +
# delta_rank_1 + delta_rank_4
swaps['dcg_delta'] = np.abs((swaps['discount_x'] - swaps['discount_y']) * (swaps['gain_x'] - swaps['gain_y']))
swaps[['qid', 'discount_x', 'discount_y', 'gain_x', 'gain_y', 'display_rank_x', 'display_rank_y', 'dcg_delta']]

Unnamed: 0,qid,discount_x,discount_y,gain_x,gain_y,display_rank_x,display_rank_y,dcg_delta
0,1,1.000000,1.000000,15,15,0,0,0.000000
1,1,1.000000,0.630930,15,7,0,1,2.952562
2,1,1.000000,0.500000,15,7,0,2,4.000000
3,1,1.000000,0.430677,15,3,0,3,6.831881
4,1,1.000000,0.386853,15,15,0,4,0.000000
...,...,...,...,...,...,...,...,...
49019,40,0.201849,0.210310,0,0,29,25,0.000000
49020,40,0.201849,0.208015,0,0,29,26,0.000000
49021,40,0.201849,0.205847,0,0,29,27,0.000000
49022,40,0.201849,0.203795,0,0,29,28,0.000000


In [7]:
swaps[swaps['qid'] == 1][:10][['qid', 'discount_x', 'discount_y', 'gain_x', 'gain_y', 'display_rank_x', 'display_rank_y', 'dcg_delta']]

Unnamed: 0,qid,discount_x,discount_y,gain_x,gain_y,display_rank_x,display_rank_y,dcg_delta
0,1,1.0,1.0,15,15,0,0,0.0
1,1,1.0,0.63093,15,7,0,1,2.952562
2,1,1.0,0.5,15,7,0,2,4.0
3,1,1.0,0.430677,15,3,0,3,6.831881
4,1,1.0,0.386853,15,15,0,4,0.0
5,1,1.0,0.356207,15,1,0,5,9.013099
6,1,1.0,0.333333,15,1,0,6,9.333333
7,1,1.0,0.315465,15,0,0,7,10.268027
8,1,1.0,0.30103,15,0,0,8,10.48455
9,1,1.0,0.289065,15,0,0,9,10.664028


## rho - a different delta

We can think of `rho` as how CLOSE / FAR the document's relevance scores are from one another. Higher rho means we're currently predicting them to be very similar in relevance score.

For example `last_prediction_x=100` but `last_prediction_y=1`. `rho` would be `(1 / (1 + e^(100-1)))` or _very small_ - 1e-43.  `last_prediction_x=2` but `last_prediction_y=1` `rho` is larger - 0.23.

We'll compare `rho` to `delta` in a bit. Spoiler - if they point in the same direction - its good! If not, well the model still has work to do.

Starting out, our predictions are always 0, so rho is 0.5.

See [these slides](https://docs.google.com/presentation/d/1LW2Nmy7GeTFGchUibnFR_BDas6ruBVo0YM72hFX9Hhs/edit#slide=id.g163bf97d989_0_533) how rho works.

In [8]:
swaps['rho'] = 1 / (1 + np.exp(swaps['last_prediction_x'] - swaps['last_prediction_y']))
swaps[['qid', 'display_rank_x', 'display_rank_y', 'dcg_delta', 'last_prediction_x', 'last_prediction_y', 'rho']]

Unnamed: 0,qid,display_rank_x,display_rank_y,dcg_delta,last_prediction_x,last_prediction_y,rho
0,1,0,0,0.000000,0.0,0.0,0.5
1,1,0,1,2.952562,0.0,0.0,0.5
2,1,0,2,4.000000,0.0,0.0,0.5
3,1,0,3,6.831881,0.0,0.0,0.5
4,1,0,4,0.000000,0.0,0.0,0.5
...,...,...,...,...,...,...,...
49019,40,29,25,0.000000,0.0,0.0,0.5
49020,40,29,26,0.000000,0.0,0.0,0.5
49021,40,29,27,0.000000,0.0,0.0,0.5
49022,40,29,28,0.000000,0.0,0.0,0.5


## Compute lambdas

Next we compute `delta*rho`

If predection (`rho`) and `actual` point in the same direction, then lambda becomes very LOW. If they point in opposite directions `lambda` becomes HIGH.

Why is this? `lambda` actually corresponds to the *error in the current prediction*.

If `rho` is low, but `delta` is high, we know the predictions already correspond to a high `delta` (remember low `rho` means big delta in predictions). However if `rho` is high AND `delta` is high prediction and actual point in OPPOSITE directions for this pairwise swap.

In [9]:
swaps["lambda"] = 0

x_better = (swaps["grade_x"] > swaps["grade_y"])
swaps.loc[x_better, "lambda"] = swaps.loc[x_better, "dcg_delta"] * swaps.loc[x_better, "rho"]

swaps[["qid", "display_rank_x", "display_rank_y", "dcg_delta",
       "last_prediction_x", "last_prediction_y", "rho", "lambda",]]

  swaps.loc[x_better, "lambda"] = swaps.loc[x_better, "dcg_delta"] * swaps.loc[x_better, "rho"]


Unnamed: 0,qid,display_rank_x,display_rank_y,dcg_delta,last_prediction_x,last_prediction_y,rho,lambda
0,1,0,0,0.000000,0.0,0.0,0.5,0.000000
1,1,0,1,2.952562,0.0,0.0,0.5,1.476281
2,1,0,2,4.000000,0.0,0.0,0.5,2.000000
3,1,0,3,6.831881,0.0,0.0,0.5,3.415941
4,1,0,4,0.000000,0.0,0.0,0.5,0.000000
...,...,...,...,...,...,...,...,...
49019,40,29,25,0.000000,0.0,0.0,0.5,0.000000
49020,40,29,26,0.000000,0.0,0.0,0.5,0.000000
49021,40,29,27,0.000000,0.0,0.0,0.5,0.000000
49022,40,29,28,0.000000,0.0,0.0,0.5,0.000000


## Accumulate back to each query-doc pair

Now that we've played with stats in pair-wise swap land, we have to bring it back together into a table of accumulated lambdas for each query-doc pair.

The accumulated lambda *becomes the relevance score we want to predict*.

Think about what this means. A high accumulated lambda means that this document has a very very high impact on DCG. All the per-position swaps in the previous steps point at big DCG changes when this row is moved around with its peers of lower grades.

So ideally we could just rank by lambda. But our goal is to actually produce a model that predicts lambda given the features.

In [10]:
# Better minus worse
lambdas_x = swaps.groupby(['qid', 'display_rank_x'])['lambda'].sum().rename('lambda')
lambdas_y = swaps.groupby(['qid', 'display_rank_y'])['lambda'].sum().rename('lambda')
lambdas = lambdas_x - lambdas_y
lambdas
lambdas_per_query = lambdas_per_query.merge(lambdas, left_on=['qid', 'display_rank'], right_on=['qid', 'display_rank_x'], how='left')
lambdas_per_query[['qid', 'doc_id', 'grade', 'features', 'lambda']]

Unnamed: 0,qid,doc_id,grade,features,lambda
0,1,7555,4,"[11.657399, 10.083591]",211.781688
1,1,1370,3,"[9.456276, 13.265001]",46.938369
2,1,1369,3,"[6.036743, 11.113943]",30.637615
3,1,13258,2,"[0.0, 6.8695450000000005]",5.888732
4,1,1368,4,"[0.0, 11.113943]",43.177578
...,...,...,...,...,...
1385,40,37079,0,"[0.0, 0.0]",-9.853045
1386,40,126757,0,"[0.0, 0.0]",-9.911575
1387,40,39797,0,"[0.0, 0.0]",-9.966853
1388,40,18112,0,"[0.0, 0.0]",-10.019174


In [11]:
lambdas_per_query[lambdas_per_query['qid'] == 12]

Unnamed: 0,uid,qid,keywords,doc_id,grade,features,last_prediction,display_rank,discount,gain,lambda
388,1236685,12,rocky horror,36685,4,"[11.207376, 5.7777519999999996]",0.0,0,1.0,15,180.364938
389,121374,12,rocky horror,1374,0,"[8.115546, 9.808358]",0.0,1,0.63093,0,-2.768027
390,121246,12,rocky horror,1246,0,"[8.115546, 9.540649]",0.0,2,0.5,0,-3.75
391,121371,12,rocky horror,1371,0,"[8.115546, 5.557296]",0.0,3,0.430677,0,-4.269926
392,121375,12,rocky horror,1375,0,"[8.115546, 9.472043]",0.0,4,0.386853,0,-4.598604
393,121367,12,rocky horror,1367,0,"[8.115546, 8.262948]",0.0,5,0.356207,0,-4.828446
394,1281830,12,rocky horror,81830,0,"[5.8909245, 7.5825286]",0.0,6,0.333333,0,-5.0
395,1260375,12,rocky horror,60375,0,"[8.115546, 7.819334]",0.0,7,0.315465,0,-5.134013
396,12110123,12,rocky horror,110123,0,"[5.8909245, 0.0]",0.0,8,0.30103,0,-5.242275
397,1233475,12,rocky horror,33475,0,"[0.0, 6.275657]",0.0,9,0.289065,0,-5.332014


## Train!

Here we go ahead and train a LambdaMART - trying to predict lambda as a function of the features (remember our title and overview BM25 scores!)

In [12]:
from sklearn.tree import DecisionTreeRegressor, plot_tree

#2. Train a regression tree on this round's lambdas
features = lambdas_per_query['features'].tolist()
tree = DecisionTreeRegressor(max_leaf_nodes=10)
tree.fit(features, lambdas_per_query['lambda'])

tree

In [13]:
tree.predict([[0,0]])

array([-5.34640374])

In [14]:
tree.predict([[11.6,13.26]])

array([43.24172417])

## Repeat!

Gradient Boosted Decision Treees are an *ensemble model*. Each subsequent tree learns to compensate for the errors in the previous decision tree. Remember above the comparison of `rho` to the `deltas` to compute `lambda` - that's a reflection of this. That's where our error comes in.

So the next model learns NOT directly on the impact to DCG, but the current error in trying to achieve the 'relevance score' of the previous model.

All we need to do to add a new tree is to repeat the steps above again, but this time `last_prediction` is the outcome of the model.

In [15]:
lambdas_per_query = lambdas_per_query

learning_rate = 0.1 # optional learning rate
features = lambdas_per_query['features'].tolist()
predictions = tree.predict(features)
lambdas_per_query['last_prediction'] += predictions * learning_rate
lambdas_per_query = lambdas_per_query.drop('lambda', axis=1)
lambdas_per_query.sort_values(['qid', 'last_prediction'], ascending=[True, False], kind='stable')

Unnamed: 0,uid,qid,keywords,doc_id,grade,features,last_prediction,display_rank,discount,gain
0,17555,1,rambo,7555,4,"[11.657399, 10.083591]",17.936963,0,1.000000,15
1,11370,1,rambo,1370,3,"[9.456276, 13.265001]",4.163755,1,0.630930,7
2,11369,1,rambo,1369,3,"[6.036743, 11.113943]",-0.534640,2,0.500000,7
3,113258,1,rambo,13258,2,"[0.0, 6.8695450000000005]",-0.534640,3,0.430677,3
4,11368,1,rambo,1368,4,"[0.0, 11.113943]",-0.534640,4,0.386853,15
...,...,...,...,...,...,...,...,...,...,...
1385,4037079,40,star wars,37079,0,"[0.0, 0.0]",-0.534640,25,0.210310,0
1386,40126757,40,star wars,126757,0,"[0.0, 0.0]",-0.534640,26,0.208015,0
1387,4039797,40,star wars,39797,0,"[0.0, 0.0]",-0.534640,27,0.205847,0
1388,4018112,40,star wars,18112,0,"[0.0, 0.0]",-0.534640,28,0.203795,0


In [16]:
# Sort by our prediction
lambdas_per_query.sort_values(['qid', 'last_prediction'], ascending=[True, False], kind='stable')
lambdas_per_query['display_rank'] = lambdas_per_query.groupby('qid').cumcount()

# Compute DCG stats for display rank
lambdas_per_query['discount'] = 1 / np.log2(2 + lambdas_per_query['display_rank']) # How much weight for this position
lambdas_per_query['gain'] = (2**lambdas_per_query['grade'] - 1)   # the 'gain' of this rank

# each group paired with each other group
swaps = lambdas_per_query.merge(lambdas_per_query, on='qid', how='outer')
# changes[j][i] = changes[i][j] = (discount(i) - discount(j)) * (gain(rel[i]) - gain(rel[j]));
swaps['dcg_delta'] = np.abs((swaps['discount_x'] - swaps['discount_y']) * (swaps['gain_x'] - swaps['gain_y']))

# Rho - magnitude of prediction difference per pairwise swap
swaps['rho'] = 1 / (1 + np.exp(swaps['last_prediction_x'] - swaps['last_prediction_y']))
swaps[['qid', 'display_rank_x', 'display_rank_y', 'dcg_delta', 'last_prediction_x', 'last_prediction_y', 'rho']]

Unnamed: 0,qid,display_rank_x,display_rank_y,dcg_delta,last_prediction_x,last_prediction_y,rho
0,1,0,0,0.000000,17.936963,17.936963,5.000000e-01
1,1,0,1,2.952562,17.936963,4.163755,1.043209e-06
2,1,0,2,4.000000,17.936963,-0.534640,9.503519e-09
3,1,0,3,6.831881,17.936963,-0.534640,9.503519e-09
4,1,0,4,0.000000,17.936963,-0.534640,9.503519e-09
...,...,...,...,...,...,...,...
49019,40,29,25,0.000000,-0.534640,-0.534640,5.000000e-01
49020,40,29,26,0.000000,-0.534640,-0.534640,5.000000e-01
49021,40,29,27,0.000000,-0.534640,-0.534640,5.000000e-01
49022,40,29,28,0.000000,-0.534640,-0.534640,5.000000e-01


In [17]:
swaps[['qid', 'display_rank_x', 'display_rank_y', 'dcg_delta', 'last_prediction_x', 'last_prediction_y', 'rho']].sort_values('rho')

Unnamed: 0,qid,display_rank_x,display_rank_y,dcg_delta,last_prediction_x,last_prediction_y,rho
12198,10,0,27,11.912298,18.679182,-0.534640,4.524211e-09
36620,31,0,36,12.141729,18.679182,-0.534640,4.524211e-09
36621,31,0,37,11.351195,18.679182,-0.534640,4.524211e-09
36622,31,0,38,12.181473,18.679182,-0.534640,4.524211e-09
36623,31,0,39,12.200214,18.679182,-0.534640,4.524211e-09
...,...,...,...,...,...,...,...
36830,31,6,0,9.333333,-0.534640,18.679182,1.000000e+00
12202,10,1,0,2.952562,-0.534640,18.679182,1.000000e+00
12698,10,17,0,11.468866,-0.534640,18.679182,1.000000e+00
37363,31,19,0,10.812617,-0.534640,18.679182,1.000000e+00


In [18]:
swaps['lambda'] = 0
slice_x_better =swaps[swaps['grade_x'] > swaps['grade_y']]
swaps.loc[swaps['grade_x'] > swaps['grade_y'], 'lambda'] = slice_x_better['dcg_delta'] * slice_x_better['rho']
swaps[['qid', 'display_rank_x', 'display_rank_y', 'dcg_delta', 'last_prediction_x', 'last_prediction_y', 'rho', 'lambda']]

 1.02243886e-02 1.11973688e-02]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  swaps.loc[swaps['grade_x'] > swaps['grade_y'], 'lambda'] = slice_x_better['dcg_delta'] * slice_x_better['rho']


Unnamed: 0,qid,display_rank_x,display_rank_y,dcg_delta,last_prediction_x,last_prediction_y,rho,lambda
0,1,0,0,0.000000,17.936963,17.936963,5.000000e-01,0.000000e+00
1,1,0,1,2.952562,17.936963,4.163755,1.043209e-06,3.080139e-06
2,1,0,2,4.000000,17.936963,-0.534640,9.503519e-09,3.801408e-08
3,1,0,3,6.831881,17.936963,-0.534640,9.503519e-09,6.492691e-08
4,1,0,4,0.000000,17.936963,-0.534640,9.503519e-09,0.000000e+00
...,...,...,...,...,...,...,...,...
49019,40,29,25,0.000000,-0.534640,-0.534640,5.000000e-01,0.000000e+00
49020,40,29,26,0.000000,-0.534640,-0.534640,5.000000e-01,0.000000e+00
49021,40,29,27,0.000000,-0.534640,-0.534640,5.000000e-01,0.000000e+00
49022,40,29,28,0.000000,-0.534640,-0.534640,5.000000e-01,0.000000e+00


In [19]:
lambdas_per_query

Unnamed: 0,uid,qid,keywords,doc_id,grade,features,last_prediction,display_rank,discount,gain
0,17555,1,rambo,7555,4,"[11.657399, 10.083591]",17.936963,0,1.000000,15
1,11370,1,rambo,1370,3,"[9.456276, 13.265001]",4.163755,1,0.630930,7
2,11369,1,rambo,1369,3,"[6.036743, 11.113943]",-0.534640,2,0.500000,7
3,113258,1,rambo,13258,2,"[0.0, 6.8695450000000005]",-0.534640,3,0.430677,3
4,11368,1,rambo,1368,4,"[0.0, 11.113943]",-0.534640,4,0.386853,15
...,...,...,...,...,...,...,...,...,...,...
1385,4037079,40,star wars,37079,0,"[0.0, 0.0]",-0.534640,25,0.210310,0
1386,40126757,40,star wars,126757,0,"[0.0, 0.0]",-0.534640,26,0.208015,0
1387,4039797,40,star wars,39797,0,"[0.0, 0.0]",-0.534640,27,0.205847,0
1388,4018112,40,star wars,18112,0,"[0.0, 0.0]",-0.534640,28,0.203795,0


In [20]:
# Better minus worse
lambdas_x = swaps.groupby(['qid', 'display_rank_x'])['lambda'].sum().rename('lambda')
lambdas_y = swaps.groupby(['qid', 'display_rank_y'])['lambda'].sum().rename('lambda')
lambdas = lambdas_x - lambdas_y
lambdas
lambdas_per_query = lambdas_per_query.merge(lambdas, left_on=['qid', 'display_rank'], right_on=['qid', 'display_rank_x'], how='left')

In [21]:
lambdas_per_query

Unnamed: 0,uid,qid,keywords,doc_id,grade,features,last_prediction,display_rank,discount,gain,lambda
0,17555,1,rambo,7555,4,"[11.657399, 10.083591]",17.936963,0,1.000000,15,0.000007
1,11370,1,rambo,1370,3,"[9.456276, 13.265001]",4.163755,1,0.630930,7,-1.043223
2,11369,1,rambo,1369,3,"[6.036743, 11.113943]",-0.534640,2,0.500000,7,32.637615
3,113258,1,rambo,13258,2,"[0.0, 6.8695450000000005]",-0.534640,3,0.430677,3,9.697947
4,11368,1,rambo,1368,4,"[0.0, 11.113943]",-0.534640,4,0.386853,15,44.136259
...,...,...,...,...,...,...,...,...,...,...,...
1385,4037079,40,star wars,37079,0,"[0.0, 0.0]",-0.534640,25,0.210310,0,-3.533441
1386,40126757,40,star wars,126757,0,"[0.0, 0.0]",-0.534640,26,0.208015,0,-3.571462
1387,4039797,40,star wars,39797,0,"[0.0, 0.0]",-0.534640,27,0.205847,0,-3.607371
1388,4018112,40,star wars,18112,0,"[0.0, 0.0]",-0.534640,28,0.203795,0,-3.641358


In [22]:
from sklearn.tree import DecisionTreeRegressor, plot_tree

#2. Train a regression tree on this round's lambdas
features = lambdas_per_query['features'].tolist()
tree2 = DecisionTreeRegressor(max_leaf_nodes=10)
tree2.fit(features, lambdas_per_query['lambda'])

tree2

## Now we have two trees - an ensemble!!

We would keep predicting trees until we're satisfied.

In [23]:
predictions =  tree2.predict(features)
lambdas_per_query['last_prediction'] += predictions * learning_rate

lambdas_per_query

Unnamed: 0,uid,qid,keywords,doc_id,grade,features,last_prediction,display_rank,discount,gain,lambda
0,17555,1,rambo,7555,4,"[11.657399, 10.083591]",17.990589,0,1.000000,15,0.000007
1,11370,1,rambo,1370,3,"[9.456276, 13.265001]",4.217381,1,0.630930,7,-1.043223
2,11369,1,rambo,1369,3,"[6.036743, 11.113943]",-0.601604,2,0.500000,7,32.637615
3,113258,1,rambo,13258,2,"[0.0, 6.8695450000000005]",-0.601604,3,0.430677,3,9.697947
4,11368,1,rambo,1368,4,"[0.0, 11.113943]",-0.601604,4,0.386853,15,44.136259
...,...,...,...,...,...,...,...,...,...,...,...
1385,4037079,40,star wars,37079,0,"[0.0, 0.0]",-0.601604,25,0.210310,0,-3.533441
1386,40126757,40,star wars,126757,0,"[0.0, 0.0]",-0.601604,26,0.208015,0,-3.571462
1387,4039797,40,star wars,39797,0,"[0.0, 0.0]",-0.601604,27,0.205847,0,-3.607371
1388,4018112,40,star wars,18112,0,"[0.0, 0.0]",-0.601604,28,0.203795,0,-3.641358


## Exercises

* Can you put the code above into a loop that does N rounds of building an ensemble?
* What would happen if you tried to optimize a statistic other than DCG? How would you use your own statistic?
* Are _trees_ strictly nescesarry here? What if we used a different model than a tree based model for one of the rounds?
* Can you log / experiment with additional features for ranking?