Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[WIP] first cut at LambdaMART #2580

Open
wants to merge 21 commits into
from

Conversation

Projects
None yet
Contributor

jwkvam commented Nov 9, 2013

This PR is an attempt to implement LambdaMART [1]. I imagine the biggest hurdle will be coming to some conclusion over the correct API since we need to include the queries somehow. In my implementation I use an extra keyword argument, I don't know if this causes problems elsewhere. My hope is that this PR can serve as a catalyst to resolve that, and then I can finish up the PR.

TODO

  • tests
  • docs
  • gbm comparison
  • Ranklib comparison
  • support an optional cutoff parameter for NDCG, usually denoted as NDCG@k
  • rewrite the example to use the yandex 2009 dataset instead of MQ2007/8 that would require the unrar command to execute
  • add a max_rank=10 cutoff parameter for NDCG and LambdaMART and use it to compute the lambdas too
  • factorize a public function ndcg_score in sklearn.metrics in pure Python first (see this comment)
  • benchmark python and cython DCG
  • implement pessimistic tie break for ndcg_score and bench it against no-tie break (default implementation sort order) and random tie break
  • clarify if the pessimistic tie break is only need for the stability ndcg_score score or for its derivatives in the LambdaMART optim loop as well (see this comment)
  • investigate if pre-computing (caching) the ideal DCG for each query on the training set as done by Ranklib can speed up learning

Possibly for another PR:

  • abstract code to support measures other than NDCG

There was also a brief discussion on the mailing list [2]. Pinging @mblondel @ogrisel @pprett from that discussion.

[1] https://research.microsoft.com/en-us/um/people/cburges/tech_reports/msr-tr-2010-82.pdf
[2] http://sourceforge.net/mailarchive/forum.php?thread_name=CAP%2B3rpGVbSux5u4dZiatV3p1f1zUvndHXoi-oh4CjMRtSpjFsw%40mail.gmail.com&forum_name=scikit-learn-general

Coverage Status

Coverage remained the same when pulling 1afaf09 on jwkvam:lambdamart into c0e686b on scikit-learn:master.

Owner

agramfort commented Nov 9, 2013

one option we considered with @fabianp to support the query without changing too much the API is to have a 2 columns y where the second column is the query index. It just means that y.ndim can be 2 now.

Owner

pprett commented Nov 9, 2013

@agramfort I think that might be indeed a better solution than having an additional fit argument.

Owner

mblondel commented Nov 11, 2013

one option we considered with @fabianp to support the query without changing too much the API is to have a 2 columns y where the second column is the query index. It just means that y.ndim can be 2 now.

Sounds good. Plus, such an API would be compatible with GridSearchCV.

In the future, we can create query-aware cross-validation generators.

@jwkvam I think I would use a dedicated class for LambaMART.

Coverage Status

Coverage remained the same when pulling b217597 on jwkvam:lambdamart into c0e686b on scikit-learn:master.

Contributor

jwkvam commented Nov 11, 2013

Thanks for the feedback, adding the queries into the labels seems reasonable to me.

If this has it's own class, I'm trying to think of the ways it would be different than GradientBoostingRegressor. I don't think the RegressorMixin makes much sense, I could write a score function which depends on the loss, maybe we could abstract that out later into a Mixin? Would it be worthwhile to leave the fit() in BaseGradientBoosting take an additional keyword argument, then I could split up y in LambdaMart's fit()? Or should I just leave all the fit's alone?

Owner

mblondel commented Nov 13, 2013

In case of ranking with a single query, LambaMART could accept a 1d y too.

Owner

ogrisel commented Nov 13, 2013

one option we considered with @fabianp to support the query without changing too much the API is to have a 2 columns y where the second column is the query index. It just means that y.ndim can be 2 now.
Sounds good. Plus, such an API would be compatible with GridSearchCV.

I am not strongly opposed to it but still I find it a bit weird to use integer coding for the relevance scores or floating point coding for the query ids.

In the future, we can create query-aware cross-validation generators.

Indeed we will need it at some point.

Owner

ogrisel commented Nov 13, 2013

@jwkvam I think it's ok not to use a the regressor mixing and not introduce a new ranking mixin yet. If we ever implement more ranking models in the future I think it should be ok to devise a new ranking mixin to factorize the common code at that point.

Owner

mblondel commented Nov 13, 2013

I am not strongly opposed to it but still I find it a bit weird to use integer coding for the relevance scores or floating point coding for the query ids.

Also 2d y could be used for multi-output ranking.

Owner

ogrisel commented Nov 13, 2013

I think we need a general concept of "sample metadata" which are datastructures with a first axis that is n_samples and that should be sliced or resampled consistently with the first axis of the input data when doing cross-validation or resampling. Potential impacted data-structures:

  • sample groups: integer to encode group samples per-subject, per-experiment per-session, per-query: can be useful for scoring learn to rank models but also to do informed cross-validation to take info about non-IIDness of the data into account (e.g. with leave LeavePLabelOut for instance).
  • sample weights: floating point values to reweight the cost function e.g. for boosting or for dealing with imbalanced datasets
  • sample unique ids: integer or string ids to trace the provenance of individual samples, especially useful when re-sampling or nudging the dataset or introspecting prediction outcomes in cross-validation folds
  • the target variable y for supervised learning (integers for binary, multiclass or multilabel classification or floating point for regression with potentially several outputs or even a mix of the two for multitask learning).

There might be more cases that I forgot.

The policy until recently was to pass them as separate variables to the fit method, either as positional args (for the target variable(s) y) or as kwargs (e.g. for sample_weight).

However some parts of our API is currenty broken / too limited. For instance transformers currently do not transform (or preserve) y nor sample_weight making it impossible to pipeline transformers that change the number of samples (e.g. nudging transformers for data expansion, resampling transformer for instance).

For now I would thus keep the query ids as a separate argument to fit and the NDCG scoring function and open a separate issue for devising how to make the API evolve for better dealing with sample-wise auxiliary data.

There is also the problem of feature-wise metadata such as feature names that could be interesting to better preserve or trace when working with FeatureUnion or feature selection transformers but this is completely unrelated to this PR.

Owner

ogrisel commented Nov 13, 2013

I am not strongly opposed to it but still I find it a bit weird to use integer coding for the relevance scores or floating point coding for the query ids.

Also 2d y could be used for multi-output ranking.

This is an even better reason not to implicitly stick the query membership info directly as a new column in y.

Owner

mblondel commented Nov 13, 2013

I think it's ok not to use a the regressor mixing and not introduce a new ranking mixin yet

Alright, then LambdaMART will need to implement its own score method.

Owner

mblondel commented Nov 13, 2013

For now I would thus keep the query ids as a separate argument to fit

In any case, query should be optional (if not provided, a single query is assumed). Another remark is that query is an information retrieval term. We could use the more neutral groups instead.

Owner

pprett commented Nov 13, 2013

@mblondel makes sense for some ranking models (eg. RankingSVM) but does it make sense for lambdaMart, does it optimize AUC if you feed it binary relevance scores?

Owner

ogrisel commented Nov 13, 2013

In any case, query should be optional (if not provided, a single query is assumed). Another remark is that query is an information retrieval term. We could use the more neutral groups instead.

+1 or maybe even sample_group to be consistent with sample_weight? This way we could use a naming convention in CV or resampling utilities to know which auxiliary data params should be consistently sliced and resampled with the input data and target value.

@mblondel makes sense for some ranking models (eg. RankingSVM) but does it make sense for lambdaMart, does it optimize AUC if you feed it binary relevance scores?

If you feed LambdaMART with binary relevance scores it just optimizes NDGC with rel in {0, 1} which is different from AUC (but a good ranking metric anyway).

Owner

mblondel commented Nov 13, 2013

@mblondel makes sense for some ranking models (eg. RankingSVM) but does it make sense for lambdaMart, does it optimize AUC if you feed it binary relevance scores?

The way I see it, ranking with a single query is similar to ordinal regression in the case when the supervision is given as relevance scores (from which you can derive a ranking). In the case of pairwise or listwise methods, the supervision can also take the form of pairwise or listwise orderings of the samples, i.e., without the notion of relevance score.

NDCG is nice because it focuses on the top k elements in the ranking whereas AUC takes into account the entire ranking (usually you care more about the top elements).

@ogrisel ogrisel and 1 other commented on an outdated diff Nov 21, 2013

sklearn/ensemble/gradient_boosting.py
@@ -585,6 +665,22 @@ def fit(self, X, y):
X, = check_arrays(X, dtype=DTYPE, sparse_format="dense",
check_ccontiguous=True)
y = column_or_1d(y, warn=True)
+ ranking = self.loss in ('ndcg')
+ if ranking:
+ if query is None:
+ raise ValueError("query must not be none with ranking measure")
@ogrisel

ogrisel Nov 21, 2013

Owner

As @mblondel said, I think we should treat the X, y data as stemming from a single query in that case.

@jwkvam

jwkvam Nov 22, 2013

Contributor

Thanks for the reminder, I will get around to it eventually.

Contributor

jwkvam commented Jan 2, 2014

Sorry for the lack of updates, admittedly I've been kind of lazy. I benchmarked the code (there are some unpushed changes) against GBM and the performance seems comparable outside of the execution times. I used 50 trees, a depth of 3 for GBRT and interaction.depth of 8 for GBM, learning rate is 0.1 and no subsampling of features or data. GBRT seems to be overfitting slightly more than GBM, this is just a guess but it could be because it's doing some sort of normalization of the gradient [1]. I tried doing something similar out of curiosity but didn't see much difference, it's possible I miscoded it however.

Dataset Software Training NDCG Validation NDCG Training Time
MQ2007 GBM 0.7206621488609668 0.6737084491077485 10.3 s
MQ2007 GBRT 0.7405864270405591 0.6777565353942717 30.6 s
MQ2008 GBM 0.7971141434532627 0.7728533219225161 2.1 s
MQ2008 GBRT 0.8105098974258365 0.7762440078297614 9.6 s

[1] https://code.google.com/p/gradientboostedmodels/source/browse/gbm/src/pairwise.cpp#786

@ogrisel ogrisel commented on an outdated diff Jan 13, 2014

sklearn/ensemble/gradient_boosting.py
@@ -585,6 +665,22 @@ def fit(self, X, y):
X, = check_arrays(X, dtype=DTYPE, sparse_format="dense",
check_ccontiguous=True)
y = column_or_1d(y, warn=True)
+ ranking = self.loss in ('ndcg')
@ogrisel

ogrisel Jan 13, 2014

Owner

You need a ('ndcg',) here for the single element tuple. I wonder how this code can work otherwise...

Owner

ogrisel commented Jan 13, 2014

@jwkvam could you please rebase this branch on top of master to take @pprett's recent refactoring into account?

Also could you write a dataset loader for the MQ2007 and MQ2088 datasets and use them as part of a new example in the examples/applications folder and compare the results with point wise linear regression model?

Owner

ogrisel commented Jan 13, 2014

Also could you please run the MQ2008 benchmark with line_profiler enabled on the fit method and report the results here?

Contributor

jwkvam commented Jan 25, 2014

@ogrisel I rebased so it's on top of @pprett's changes.

Other changes include:

  • The group argument is no longer necessary, if it isn't present the entire dataset is treated as one group.
  • Followed gbm's approach of not modifying the target values by (2**y - 1), if the user wants that then they will need to preprocess the labels. Unfortunately this can lead to degenerate NDCG cases, for example you could let the relevance values for two samples be log(2) and -log(3), the optimal DCG would be zero so this could easily lead to -Inf NDCG values. I like the power this gives to the user but it can lead to bad things happening.
  • If all the relevance values in a group are identical, gbm makes NDCG be NaN, I like this approach and was going to incorporate it since this can inflate the score of the overall set. I thought I'd check with others first though.
  • Modified ZeroEstimator so it can be used with LambdaMART with integer values. It was a quick hack so I expect people may want a different approach.

Running my same benchmark on MQ2008 takes 6.4 seconds now, I wasn't able to get line profiler working but here are the main culprits. I imagine I could speed up the NDCG code by presorting the labels and pushing the loops into Cython.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    2.309    0.046    2.329    0.047 {method 'build' of 'sklearn.tree._tree.DepthFirstTreeBuilder' objects}
       50    2.165    0.043    2.942    0.059 gradient_boosting.py:394(negative_gradient)
      154    2.118    0.014    2.620    0.017 gradient_boosting.py:364(__call__)
        4    0.448    0.112    0.486    0.121 npyio.py:880(savetxt)
        1    0.436    0.436    9.121    9.121 parse.py:3(<module>)

I'll add writing up that example to my TODO list.

I added LambdaMART to all the tests of the form for Cls in, they all pass aside from test_warm_start_oob_switch I still need to figure out why that's failing.

Owner

ogrisel commented Jan 26, 2014

Hi @jwkvam could you please re-run some bench using the max_leaf_nodes parameter instead of max_depth? It should be closer to GBM's behavior.

Contributor

jwkvam commented Jan 27, 2014

@ogrisel yes, thanks for your continued feedback.

I set max_leaf_nodes=9 which is equivalent to GBM's interaction.depth=8.

Dataset Software Training NDCG Validation NDCG Training Time
MQ2007 GBM 0.7206621488609668 0.6737084491077485 10.3 s
MQ2007 GBRT max_depth=3 0.7405864270405591 0.6777565353942717 21.7 s
MQ2007 GBRT max_leaf_nodes=9 0.730921178027 0.679968120327 21.6 s
MQ2008 GBM 0.7971141434532627 0.7728533219225161 2.1 s
MQ2008 GBRT max_depth=3 0.8105098974258365 0.7762440078297614 6.1 s
MQ2008 GBRT max_leaf_nodes=9 0.794975081183 0.773451965618 5.9 s

@ogrisel ogrisel commented on an outdated diff Jan 28, 2014

sklearn/ensemble/gradient_boosting.py
+ J. Friedman, Stochastic Gradient Boosting, 1999
+
+ T. Hastie, R. Tibshirani and J. Friedman.
+ Elements of Statistical Learning Ed. 2, Springer, 2009.
+ """
+
+ _SUPPORTED_LOSS = ('ndcg',)
+
+ def __init__(self, loss='ndcg', learning_rate=0.1, n_estimators=100,
+ subsample=1.0, min_samples_split=2, min_samples_leaf=1,
+ max_depth=3, init=None, random_state=None,
+ max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None,
+ warm_start=False):
+
+ super(LambdaMART, self).__init__(
+ loss, learning_rate, n_estimators, min_samples_split,
@ogrisel

ogrisel Jan 28, 2014

Owner

Please hardcode 'ndcg' for now as there is no alternative ranking loss implemented for lambdamart.

@ogrisel ogrisel commented on an outdated diff Jan 28, 2014

sklearn/ensemble/gradient_boosting.py
+class LambdaMART(BaseGradientBoosting):
+ """LambdaMART for learning to rank.
+
+ GB builds an additive model in a forward stage-wise fashion;
+ it allows for the optimization of arbitrary differentiable loss functions.
+ In each stage a regression tree is fit on the negative gradient of the
+ given loss function.
+
+ Parameters
+ ----------
+ loss : {'ndcg'}, optional (default='ndcg')
+ loss function to be optimized. 'ls' refers to least squares
+ regression. 'lad' (least absolute deviation) is a highly robust
+ loss function solely based on order information of the input
+ variables. 'huber' is a combination of the two. 'quantile'
+ allows quantile regression (use `alpha` to specify the quantile).
@ogrisel

ogrisel Jan 28, 2014

Owner

Please remove the loss parameter as it's always 'ndcg' for LambdaMART.

I starting testing this implementation performance and validate its results against Ranklib yesterday evenning
on Yandex dataset.

It runs smoothly with subsample=1.0- but subsample < 1.0 fails with the following error.

[2014/01/28-01:09:29.870] [Exec-20] [INFO] [process]  -   File "/tmp/1390867768644-0/script.py", line 51, in <module>
[2014/01/28-01:09:29.870] [Exec-20] [INFO] [process]  -     model.fit(X, score, monitor=None, group=group)
[2014/01/28-01:09:29.870] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 1720, in fit
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -     return super(LambdaMART, self).fit(X, y, monitor, group)
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 901, in fit
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -     begin_at_stage, monitor, group)
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 958, in _fit_stages
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  -     y_pred[~sample_mask], group=group_oob)
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 369, in __call__
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  -     last_group = group[0]
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  - IndexError: index out of bounds

Has someone encounterred the same error?
Also, let me know if you want access to the dataset (around 11 GB).

Owner

ogrisel commented Jan 28, 2014

@poulejapon I was about to report the exact same error when trying on MSLR-10K:

Traceback (most recent call last):

  File "<ipython-input-47-8b8bb31a05fb>", line 1, in <module>
    get_ipython().run_cell_magic(u'time', u'', u'\nfrom sklearn.ensemble import LambdaMART\n\nlmart= LambdaMART(n_estimators=100, max_leaf_nodes=7,\n                  learning_rate=0.03,\n                  subsample=0.5, random_state=1, verbose=1)\nlmart.fit(X_train_small, y_train_small, group=qid_train_small)')
  File "/volatile/ogrisel/envs/py27/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2129, in run_cell_magic
    result = fn(magic_arg_s, cell)
  File "<string>", line 2, in time
  File "/volatile/ogrisel/envs/py27/local/lib/python2.7/site-packages/IPython/core/magic.py", line 191, in <lambda>
    call = lambda f, *a, **k: f(*a, **k)
  File "/volatile/ogrisel/envs/py27/local/lib/python2.7/site-packages/IPython/core/magics/execution.py", line 1045, in time
    exec code in glob, local_ns
  File "<timed exec>", line 7, in <module>
  File "/volatile/ogrisel/code/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 1720, in fit
    return super(LambdaMART, self).fit(X, y, monitor, group)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 901, in fit
    begin_at_stage, monitor, group)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 950, in _fit_stages
    random_state)
  File "_gradient_boosting.pyx", line 315, in sklearn.ensemble._gradient_boosting._ranked_random_sample_mask (sklearn/ensemble/_gradient_boosting.c:4479)
ValueError: Buffer dtype mismatch, expected 'int32' but got 'long'

No quite the same. To get to mine, I think you need to .astype(np.int32) your group vector. :)

Owner

ogrisel commented Jan 28, 2014

Yes indeed. I get the same error as yours once I switch the dtype. So we need:

  • use cython fused types to support int32 and int64 indices (or convert to int32 by assuming that we have less than 4B queries in the dataset which is reasonable)
  • fix the actual sampling bug.

I'm unfamiliar with cython so I didn't try to take a look at the cython types thingy.

For the other bug, I patched it that way :
fulmicoton/scikit-learn@8088b52

I just ran a larger test but on my first test, the oob_improvment column showed 0 for all new iteration/tree.

I don't know if that is the right way to do it, but I tracked the oob_score using a monitor, and the it looked just fine.

def monitor(i, model, *args, **kargs):
    print i,model.oob_score_.mean()
Owner

ogrisel commented Jan 29, 2014

@jwkvam could you please rebase this branch on top of master and recythonize the _tree.c and _gradient_boosting.c to take the latest optims from master into account?

Contributor

jwkvam commented Jan 29, 2014

@poulejapon Thanks for catching the sampling bug, I missed a snippet of code when I last rebased. Obviously I still have many tests to add. I created a fused type so you don't need to explicitly cast to int32, if that's a reasonable approach. I'm not sure about your lack of oob improvement at the moment, thanks for testing it on another dataset though.

jwkvam and others added some commits Jan 25, 2014

@jwkvam jwkvam lambdamart, rebased on master c423210
@fulmicoton @jwkvam fulmicoton + jwkvam Fixes bug happening subsample < 1 with non-individual groups in Lambd…
…aMART.

Basically _ranked_random_sample_mask does a perfect job selecting
a sample mask consistent with the bags : basically a group should
be either fully out of the bag or within the bag.

However the n_total_in_bag argument passed to it are giving the number of elements
to put in the sample rather than the number of groups.
1367bf9
@jwkvam jwkvam - hardcode ndcg loss
- use fused type for group to support int32 and int64
- add test for ranked sample mask
9f869e5
Contributor

jwkvam commented Jan 29, 2014

@ogrisel I rebased and ran my test case, I didn't notice a discernible difference however.

Owner

ogrisel commented Jan 29, 2014

Thanks for the rebase. Why did you create a new test file test_sample_mask.py with such a generic name rather than adding the test to test_gradient_boosting.py? Also travis is reporting a failure in the latter.

Owner

ogrisel commented Jan 29, 2014

About the speed difference is expected not to be very impacting for GBRT models: it's just CPU cache optim. On my box a 33min GBRT fit was accelerated down to 30min. Anyway having rebased on master now makes it possible for travis to run the full test suite.

Contributor

jwkvam commented Jan 29, 2014

Actually after I committed it I was thinking that maybe it should just go in the test_gradient_boosting.py, since you mentioned it I'll move it over there. Yea I'm aware of the current travis failure, the oob improvement is 0 on occasion for LambdaMART causing a failure. At the moment I'm not sure if it is a poor test for LambdaMART or a real flaw.

@jwkvam jwkvam Added application example comparing LambdaMART with gradient boosting
and linear regression. Moved sampling test into main
test_gradient_boosting.py.
e7fc6b3
Contributor

jwkvam commented Jan 29, 2014

Here's the output I get from the script I put in examples/applications.

LambdaMART fit in 12.386242s
LambdaMART training score is 0.832991
LambdaMART validation score is 0.769186
GradientBoostingRegressor training score is 0.817139
GradientBoostingRegressor validation score is 0.756183
LinearRegression training score is 0.729005
LinearRegression validation score is 0.738941
Owner

ogrisel commented Jan 29, 2014

I had not noticed the LETOR MQ2007 / MQ2008 datasets were shipped under the rar package format. This is unfortunate because we cannot write a generic loader using only the Python standard library. Is there any other publicly available learning to rank dataset that is small enough (e.g. less than 50MB) but under a zip or tar.gz format?

Owner

ogrisel commented Jan 29, 2014

This yandex 2009 dataset might be interesting to build an example:

http://imat2009.yandex.ru/en/datasets

@mblondel mblondel commented on the diff Jan 29, 2014

sklearn/ensemble/_gradient_boosting.pyx
+
+ Returns
+ -------
+ sample_mask : np.ndarray, shape=[n_total_samples]
+ An ndarray where ``n_total_in_bag`` elements are set to ``True``
+ the others are ``False``.
+ """
+ cdef np.ndarray[float64, ndim=1, mode="c"] rand = \
+ random_state.rand(n_total_samples)
+ cdef np.ndarray[int8, ndim=1, mode="c"] sample_mask = \
+ np_zeros((n_total_samples,), dtype=np_int8)
+
+ cdef int n_bagged = 0
+ cdef int i = 0
+
+ for i from 0 <= i < n_total_samples:
@mblondel

mblondel Jan 29, 2014

Owner

for i in xrange(n_total_samples) is more Pythonic and should result in the same C code.

@glouppe

glouppe Jan 29, 2014

Owner

Even better, use range instead of xrange :)

@jwkvam

jwkvam Jan 30, 2014

Contributor

Sure, I'm still pretty inexperienced with cython so I was following some example code, seems like the from style is no longer necessary.

@mblondel mblondel and 1 other commented on an outdated diff Jan 30, 2014

sklearn/ensemble/_gradient_boosting.pyx
+ else:
+ mask = 0
+ sample_mask[i] = mask
+
+ return sample_mask.astype(np_bool)
+
+
+def _ndcg(all32_64_t [::1] y, all32_64_t [:] sorty):
+ """Computes Normalized Discounted Cumulative Gain
+ Currently there is no iteration cap.
+ """
+ cdef int i
+ cdef double dcg = 0
+ cdef double max_dcg = 0
+ for i from 0 <= i < y.shape[0]:
+ dcg += y[i] / log(2 + i)
@mblondel

mblondel Jan 30, 2014

Owner

I think an option to choose between linear (y[i]) and exponential gains (2 ** y[i]) would be nice. In a project of mine, I prefer linear gains but it seems to me that most IR papers use exponential gains.

@jwkvam

jwkvam Jan 30, 2014

Contributor

To me, it would seem reasonable to let the user preprocess the labels. Perhaps a downside to this would be if a new user is expecting the code to automatically process the labels as 2 ** y - 1.

Owner

mblondel commented Jan 30, 2014

It seems to me that this implementation computes NDCG over the entire dataset. In a project of mine, I want to optimize NDCG@k so a constructor option to let the user specify this would be nice.

I've tried this implementation on my project and LambdaMART performs slighly worse than a RandomForestRegressor (samples values of n_estimators and max_features) w.r.t. NDCG@k.

Owner

ogrisel commented Jan 30, 2014

@jwkvam I have opened a PR with cosmit jwkvam/scikit-learn#1. I also noticed that the training loss can be increasing even with very low learning rates. I suspect ties handling might be the culprit.

Edit: ignore my remark on increasing training loss: the criterion here is the ndcg score which is a score to maximize instead of a loss to minimize. A deterministic handling of ties might be interesting none the less as I suspect that the NDCG implementation in this LambdaMART is too optimistic.

Is there a standard way to handle ties in NDGC? For instance what is the true NDGC for:

>>> y_toy_true = np.array([1, 0, 0], dtype=np.float32)
>>> y_toy_pred = np.array([0, 0, 0], dtype=np.float32)
>>> lmart.loss_(y_toy_true, y_toy_pred.reshape(-1, 1))
1.0

I doubt that always predicting zero should have score of 1. WDYT?

Owner

ogrisel commented Jan 30, 2014

It seems to me that this implementation computes NDCG over the entire dataset. In a project of mine, I want to optimize NDCG@k so a constructor option to let the user specify this would be nice.

I agree that the literature often reports NDCG@k for k in [3 - 10]. It seems that the implementation in ranklib can use a cutoff point to only compute lambda (the dcg derivative) for the top samples:

http://sourceforge.net/p/lemur/code/HEAD/tree/RankLib/trunk/src/ciir/umass/edu/learning/tree/LambdaMART.java#l377

Owner

ogrisel commented Jan 30, 2014

Maybe a good way to handle the ties would be to sort by decreasing y_pred first and then by increasing y_true in case of tie to be pessimistic. We can use np.lexsort for this apparently. We can also use that for sorting several groups of samples in one go (sort by group, then by decreasing y_pred then by increasing y_true:

>>> group  = np.array([1, 1, 1, 0, 0, 2, 2, 2, 2])
>>> y_true = np.array([3, 4, 0, 4, 2, 4, 3, 0, 0])
>>> y_pred = np.array([3, 0, 0, 2, 3, 2, 3, 0, 1])
>>> y_pred_inv = y_pred * -1
>>> ix = np.lexsort([y_true, y_pred_inv, group])
>>> ix
array([4, 3, 0, 2, 1, 6, 5, 8, 7])
>>> group[ix]
array([0, 0, 1, 1, 1, 2, 2, 2, 2])
>>> y_pred[ix]
array([3, 2, 3, 0, 0, 3, 2, 1, 0])
>>> y_true[ix]
array([2, 4, 3, 0, 4, 3, 4, 0, 0])

This way, in one call to numpy.lexsort we can sort all the samples at once in a manner that penalizes y_pred ties.

Contributor

jwkvam commented Jan 30, 2014

@ogrisel I'll take a look at the yandex dataset. I did see your PR, I'll merge it soon.

I guess I shouldn't be surprised that people would like a cutoff parameter. I'll add that to this PR. RankLib forces you to have a cutoff parameter [1]. GBM calls it max.rank, I would call the parameter max_rank unless there are objections/suggestions.

I haven't thought about handling ties up to this point, I've blissfully ignored that issue. At first glance, I like your pessimistic approach though. I'd be interested in seeing what other packages do.

[1] http://sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/

Owner

glouppe commented Jan 30, 2014

Regarding ties, I think we should instead have a look at the literature to
see what is the standard way to handle them instead of making something up.
Rank-based metrics are often sensitive to ties handling. We should really
be cautious about that.

On 30 January 2014 09:39, Olivier Grisel notifications@github.com wrote:

Maybe a good way to handle the ties would be to sort by decreasing y_predfirst and then by increasing
y_true in case of tie to be pessimistic. We can use np.lexsort for this
apparently. We can also use that for sorting several groups of samples in
one go (sort by group, then by decreasing y_pred then by increasing y_true
:

group = np.array([1, 1, 1, 0, 0, 2, 2, 2, 2])
y_true = np.array([3, 4, 0, 4, 2, 4, 3, 0, 0])
y_pred = np.array([3, 0, 0, 2, 3, 2, 3, 0, 1])
y_pred_inv = y_pred * -1
ix = np.lexsort([y_true, y_pred_inv, group])
ix
array([4, 3, 0, 2, 1, 6, 5, 8, 7])
group[ix]
array([0, 0, 1, 1, 1, 2, 2, 2, 2])
y_pred[ix]
array([3, 2, 3, 0, 0, 3, 2, 1, 0])
y_true[ix]
array([2, 4, 3, 0, 4, 3, 4, 0, 0])

This way, in one call to numpy.lexsort we can sort all the samples at
once in a manner that penalizes y_pred ties.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2580#issuecomment-33668933
.

Contributor

jwkvam commented Jan 30, 2014

Section 2.3 from [1] also mentions pessimistic handling of ties.

[1] http://web.mit.edu/rudin/www/BertsimasChRuOR38811.pdf

Owner

mblondel commented Jan 30, 2014

Most papers I've read assume no ties or don't even mention them. So I think results reported in various papers are implementation-dependent.

Owner

mblondel commented Jan 30, 2014

BTW, I've started to implement a few metrics in this branch:
https://github.com/mblondel/scikit-learn/tree/ranking_metrics

Contributor

jwkvam commented Jan 30, 2014

GBM avoids ties by perturbing the scores by small random values [1], it scores the data as follows

> perf.pairwise(c(1,0,0), c(0,0,0), c(0,0,0))
[1] 1
> perf.pairwise(c(0,1,0), c(0,0,0), c(0,0,0))
[1] 0.6309298
> perf.pairwise(c(0,0,1), c(0,0,0), c(0,0,0))
[1] 0.5

I would prefer a deterministic approach myself.

[1] https://code.google.com/p/gradientboostedmodels/source/browse/gbm/src/pairwise.cpp#40

Contributor

jwkvam commented Feb 11, 2014

Okay, I suppose following the principle of least surprise is good.

@mblondel, is my implementation fine by you? I also thought gains plural sounded a little awkward so I made it singular.

Owner

mblondel commented Feb 11, 2014

@jwkvam I personally prefer the plural form since each instance induces a gain. Also, could you use a private method instead of a lambda? Attributes with a trailing slash are usually used for fitted parameters.

Contributor

jwkvam commented Feb 12, 2014

Okay that makes sense, it's private. I guess my thinking was that you choose a single transform function. But it's not a big deal to me and I don't want to turn this into a bikeshed, so if you feel strongly about it or if someone else wants to briefly chime in, I'll change it.

Contributor

jwkvam commented Feb 16, 2014

I tested different tie handling schemes on the MQ2007 and MQ2008 datasets. I only included the aggregate data over the 5 folds. I wouldn't read too much into the training times, it seems to be sensitive to my web browsing activity (they are all comparable however). Unfortunately there doesn't appear to be a dominant strategy, at least not for this particular test I've performed. I was leaning towards pessimistic to begin with so that's what I've implemented. At the very least I don't think it's a bad choice.

I profiled the script, for pessimistic the cumulative time spent in np.lexsort was 5.5s out of a total runtime of 302s for the script (1.82%), for no tie breaks the cumulative time in np.argsort was 5s out of 300s for the script (1.67%).

Note that all final scores were performed using pessimistic tie breaks.

Scheme Dataset Mean Std Train Time
Pessimistic 07 Train 0.58460 0.00926 40.47283s
Random 07 Train 0.58607 0.00858 41.63353s
None 07 Train 0.58618 0.01016 42.57884s
Pessimistic 07 Valid 0.52699 0.01604
Random 07 Valid 0.53066 0.01796
None 07 Valid 0.53054 0.01754
Pessimistic 08 Train 0.79043 0.00743 12.23776s
Random 08 Train 0.79386 0.00855 13.00861s
None 08 Train 0.79175 0.00744 12.89668s
Pessimistic 08 Valid 0.71254 0.02501
Random 08 Valid 0.70815 0.02286
None 08 Valid 0.71157 0.02759

Edit: I had to update the numbers because I was exponentiating the labels twice.

I futzed a little with the parameters, so with n_estimators=250 and learning_rate=0.03. Both experiments were done with max_leaf_nodes=10. I get the following.

Scheme Dataset Mean Std
Pessimistic 07 Train 0.58605 0.00997
None 07 Train 0.58611 0.01016
Pessimistic 07 Valid 0.53167 0.01418
None 07 Valid 0.52654 0.01796
Pessimistic 08 Train 0.79527 0.00725
None 08 Train 0.79579 0.00581
Pessimistic 08 Valid 0.71485 0.01989
None 08 Valid 0.70897 0.02246
Owner

ogrisel commented Feb 21, 2014

@jwkvam Nice work.

@poulejapon Indeed pessimistic tie breaks are expect to have an influence on the final score (and also the optimization). The factorization you mentioned earlier is only solving the tie issues for the first model described in the paper (RankNet). In LambdaMART there is delta-NDCG term that depends on the ordering. Furthermore the truncation is also impacted by the way ties are dealt with.

Owner

ogrisel commented Feb 21, 2014

@jwkvam could you please add more tests for the NDCG score function on toy predicted and true ranks with ties? For instance you could check assertions like:

ndcg_score([2, 1, 0], [1, 1, 1]) == ndcg_score([0, 1, 2], [1, 1, 1]) == ndcg_score([0, 1, 2], [0, 0, 0]) ==  1 + 1. / np.log2(1 + 1) + 2. / np.log2(1 + 2)

@mblondel mblondel and 1 other commented on an outdated diff Feb 25, 2014

sklearn/ensemble/gradient_boosting.py
+ T. Hastie, R. Tibshirani and J. Friedman.
+ Elements of Statistical Learning Ed. 2, Springer, 2009.
+ """
+
+ _SUPPORTED_LOSS = ('ndcg',)
+
+ def __init__(self, learning_rate=0.1, n_estimators=100,
+ subsample=1.0, min_samples_split=2, min_samples_leaf=1,
+ max_depth=3, init=None, random_state=None,
+ max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None,
+ max_rank=10, gain='exponential', warm_start=False):
+
+ self.gain = gain
+ self._gain = lambda y: y
+ if gain == 'exponential':
+ self._gain = lambda y: 2**y - 1
@mblondel

mblondel Feb 25, 2014

Owner

Could you use a private method and not a lambda? Lambdas cannot be pickled.

@jwkvam

jwkvam Feb 25, 2014

Contributor

@mblondel Nice catch, I'll add a test and change it.

Contributor

jwkvam commented Feb 25, 2014

@jwkvam could you please add more tests...

@ogrisel Definitely, it's still on my todo :)

@jwkvam jwkvam - Provide rational for tree leaf updates, leaf = numerator /
  (denominator + eps)
- Give a consistent NDCG score of 1 to groups where all the labels are
  identical
- Moved lambda to private function, since cPickle doesn't support
  lambdas
- Added a couple tests for pessimistic NDCG
bfa2882
Contributor

jwkvam commented Feb 25, 2014

factorize a public function ndcg_score in sklearn.metrics

@ogrisel @mblondel Do you guys want this item to be included in this PR, it looks like this will be handled in #2805? Then we can refactor this code when that is merged? What are your thoughts?

Owner

ogrisel commented Feb 25, 2014

factorize a public function ndcg_score in sklearn.metrics

@ogrisel @mblondel Do you guys want this item to be included in this PR, it looks like this will be handled in #2805? Then we can refactor this code when that is merged? What are your thoughts?

We can delay that to once #2805 is merged.

Owner

mblondel commented Feb 25, 2014

+1 to factoring code later.

@ogrisel about the ties : I see! Thank you for the explanation. Sorry for not helping much. I'm out of my depth here.

@glouppe glouppe commented on an outdated diff Feb 26, 2014

sklearn/ensemble/gradient_boosting.py
+
+ `loss_` : LossFunction
+ The concrete ``LossFunction`` object.
+
+ `init` : BaseEstimator
+ The estimator that provides the initial predictions.
+ Set via the ``init`` argument or ``loss.init_estimator``.
+
+ `estimators_`: list of DecisionTreeRegressor
+ The collection of fitted sub-estimators.
+
+ See also
+ --------
+ DecisionTreeRegressor, RandomForestRegressor
+
+ References
@glouppe

glouppe Feb 26, 2014

Owner

The proper reference for Lambda MART is the following:
Q. Wu, C.J.C. Burges, K. Svore and J. Gao. Adapting Boosting for Information Retrieval Measures. Journal of Information Retrieval, 2007.

@glouppe glouppe commented on an outdated diff Feb 26, 2014

sklearn/ensemble/gradient_boosting.py
@@ -1416,3 +1537,265 @@ def staged_predict(self, X):
"""
for y in self.staged_decision_function(X):
yield y.ravel()
+
+
+class LambdaMART(BaseGradientBoosting):
+ """LambdaMART for learning to rank.
+
+ GB builds an additive model in a forward stage-wise fashion;
+ it allows for the optimization of arbitrary differentiable loss functions.
+ In each stage a regression tree is fit on the negative gradient of the
+ given loss function.
+
@glouppe

glouppe Feb 26, 2014

Owner

The docstring here makes no reference to LambdaMART in itself. It should be improved.

@glouppe glouppe commented on an outdated diff Feb 26, 2014

sklearn/ensemble/gradient_boosting.py
+ return y
+
+ def fit(self, X, y, monitor=None, group=None):
+ """Fit the gradient boosting model.
+
+ Parameters
+ ----------
+ X : array-like, shape = [n_samples, n_features]
+ Training vectors, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ y : array-like, shape = [n_samples]
+ Target values (integers in classification, real numbers in
+ regression)
+ For classification, labels must correspond to classes
+ ``0, 1, ..., n_classes_-1``.
@glouppe

glouppe Feb 26, 2014

Owner

This should be updated.

@glouppe

glouppe Feb 26, 2014

Owner

What I mean is that the docstring should explain what is the expected format for learning to rank problems.

@glouppe glouppe commented on an outdated diff Feb 26, 2014

sklearn/ensemble/gradient_boosting.py
+ regression)
+ For classification, labels must correspond to classes
+ ``0, 1, ..., n_classes_-1``.
+
+ monitor : callable, optional
+ The monitor is called after each iteration with the current
+ iteration, a reference to the estimator and the local variables of
+ ``_fit_stages`` as keyword arguments ``callable(i, self,
+ locals())``. If the callable returns ``True`` the fitting procedure
+ is stopped. The monitor can be used for various things such as
+ computing held-out estimates, early stopping, model introspect, and
+ snapshoting.
+
+ group : array-like, shape = [n_samples], optional (default=None)
+ Used to group samples. If not present, then the all the
+ samples are treated as one group.
@glouppe

glouppe Feb 26, 2014

Owner

Can you be more explicit? I am afraid users won't understand what this means without some more context. (I don't.)

@glouppe

glouppe Feb 26, 2014

Owner

As far as I understand, group is specific to to LambdaMART and only affects the underlying loss function ndcg. Is it really necessary to pass this argument down to the parent BaseGradientBoosting. Wouldn't it be possible to simply instantiate the loss function in LambdaMART, passing group in its constructor? That way, we wouldn't have the group argument to sneak everywhere into the base implementation.

@glouppe glouppe commented on an outdated diff Feb 26, 2014

sklearn/ensemble/gradient_boosting.py
@@ -735,7 +837,11 @@ def fit(self, X, y, monitor=None):
locals())``. If the callable returns ``True`` the fitting procedure
is stopped. The monitor can be used for various things such as
computing held-out estimates, early stopping, model introspect, and
- snapshoting.
+ snapshotting.
+
+ group : array-like, shape = [n_samples], optional (default=None)
+ Only used with LambdaMART, used to group samples. If not present,
+ then the all the samples are treated as one group.
@glouppe

glouppe Feb 26, 2014

Owner

Same comment here.

@glouppe glouppe commented on the diff Feb 26, 2014

sklearn/ensemble/gradient_boosting.py
@@ -822,26 +931,38 @@ def _fit_stages(self, X, y, y_pred, random_state, begin_at_stage=0,
# subsampling
if do_oob:
@glouppe

glouppe Feb 26, 2014

Owner

I think it is time for this to be factorized into a _set_oob_score method, as we do in forest. That way, _set_oob_score could be specialized in LambdaMART to take group into account, without complexifying the base implementation of GBRT.

@glouppe glouppe commented on an outdated diff Feb 26, 2014

sklearn/ensemble/gradient_boosting.py
+ Parameters
+ ----------
+ X : array-like, shape = (n_samples, n_features)
+ Test samples.
+
+ y : array-like, shape = (n_samples,)
+ True labels for X.
+
+ group : array-like, shape = [n_samples], optional (default=None)
+ Used to group samples. If not present, then the all the
+ samples are treated as one group.
+
+ Returns
+ -------
+ score : float
+ Mean accuracy of self.predict(X) wrt. y.
@glouppe

glouppe Feb 26, 2014

Owner

It is NDCG rather than accuracy, isn't it?

@glouppe glouppe commented on an outdated diff Feb 26, 2014

sklearn/ensemble/gradient_boosting.py
@@ -348,6 +355,97 @@ def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
tree.value[leaf, 0] = val
+class NormalizedDiscountedCumulativeGain(RegressionLossFunction):
+ """Quantify ranking by weighing more top higher ranked samples.
+
+ Contrary to other subclasses of RegressionLossFunction, this is not a loss
+ function to minimize but a score function to maximize.
+ """
+ def __init__(self, n_classes, max_rank=10):
+ super(NormalizedDiscountedCumulativeGain, self).__init__(n_classes)
+ if max_rank is not None:
+ assert max_rank > 0
@glouppe

glouppe Feb 26, 2014

Owner

Parameters should rather be checked at the beginning of fit.

@glouppe glouppe commented on an outdated diff Feb 26, 2014

sklearn/ensemble/_gradient_boosting.pyx
+
+def _max_dcg(all32_64_t [:] y_sorted):
+ """Computes Maximum Discounted Cumulative Gain
+ """
+ cdef int i
+ cdef double max_dcg = 0
+ for i in range(y_sorted.shape[0]):
+ max_dcg += y_sorted[i] / log(2 + i)
+ return max_dcg
+
+
+def _lambda(all32_64_t [::1] y_true, double [::1] y_pred,
+ max_rank):
+ """Computes the gradient and second derivatives for NDCG
+
+ This part of the LambdaMART algorithm.
@glouppe

glouppe Feb 26, 2014

Owner

typo: +is

Contributor

jwkvam commented Feb 27, 2014

@glouppe Thanks for the comments. I'm responding here since some of your still relevant comments were folded by the commits. Please let me know if you still find those comments to be unclear.

As far as I understand, group is specific to to LambdaMART and only affects the underlying loss function ndcg. Is it really necessary to pass this argument down to the parent BaseGradientBoosting. Wouldn't it be possible to simply instantiate the loss function in LambdaMART, passing group in its constructor? That way, we wouldn't have the group argument to sneak everywhere into the base implementation.

The one other place group is being used is for random subsampling. I don't particularly care for the current state of that code either, the group_inbag and group_oob variables are kind of warty. I don't think I would want to make group a class variable however? Would it be reasonable to pass group into BaseGradientBoosting and make specialized functions for oob related functions and random sampling?

Owner

ogrisel commented Feb 27, 2014

I would rename that variable as sample_group as it's similar in spirit as the sample_weight convention.

Owner

ogrisel commented Feb 27, 2014

group or sample_group is data dependent (has shape (n_samples,)) so it should not be a parameter of the constructor of the estimators but rather an argument to the fit and score methods as sample_weight is for other estimators.

Contributor

jwkvam commented Feb 27, 2014

@ogrisel I'm probably confusing the issue, but I believe @glouppe was suggesting putting group in the NDCG class constructor.

Owner

glouppe commented Feb 27, 2014

My biggest concern is that, since group is specific to learning to rank, I don't think it should be harded-coded within the base implementation of GBRT. It really is complexifying our base implementation while it shouldn't even touch it. Ideally, it should only appear within LambdaMART, and nowhere else.

I think this can be made possible by:

  • Injecting group in NormalizedDiscountedCumulativeGain state or by redefining the __call__ and negative_gradient methods as partial functions where group has been set from the argument passed in fit or score. That way, the base implementation of GBRT would be totally oblivious of this additional argument. (This is subject to discussion, there may be cleaner solutions to achieve this.)
  • Factorizing the computation of the OOB estimates into a _set_oob_score method and then specializing it in LambdaMART.
Owner

glouppe commented Feb 27, 2014

Please let me know if you still find those comments to be unclear.

Thanks for the improvements on the documentation. It is much clearer!

Contributor

jwkvam commented Feb 27, 2014

My issue with partial is that it would keep a reference on group. Would that be acceptable?

Owner

ogrisel commented Feb 27, 2014

@glouppe I think I agree with the plan expressed in #2580 (comment) although I have not checked the details.

Owner

glouppe commented Feb 27, 2014

My issue with partial is that it would keep a reference on group. Would that be acceptable

It could be temporarily set as a partial, set before fit and unset after fit. (The same for score.)
Note that there may be a more elegant solution -- have to think about it.

Contributor

jwkvam commented Feb 27, 2014

Keep in mind I would also need to create a partial function for _ranked_random_sample_mask, or deal with it in some other way.

Owner

glouppe commented Feb 27, 2014

Keep in mind I would also need to create a partial function for _ranked_random_sample_mask, or deal with it in some other way.

Indeed, but it could be factorized out e.g. in a _random_sample_mask method in the same way as _set_oob_score and then specialized in LambdaMART to invoke the ranked version.

Owner

glouppe commented Feb 27, 2014

Btw, sorry to bother you with this design issues. Your contribution is really great! But we really have to be careful in not complexifying our implementations. Otherwise, with time and additional features, it will blow up to the point where nobody can understand what is going on...

Contributor

jwkvam commented Feb 27, 2014

No, I agree completely. I'm not too happy with the new ZeroEstimator either if you would like to help think of something better.

Contributor

jwkvam commented Feb 28, 2014

@glouppe When you have some time could you look at the refactoring done in jwkvam/scikit-learn@12b54ca?

Basically I replaced sample_group with **kargs, it seems cleaner to me, the complexity introduced by LambdaMART is now in that class. Unfortunately I had to introduce a mask parameter to the loss functions. But if people don't like it, I'll revert the commit.

Owner

ogrisel commented Feb 28, 2014

I don't see the point in adding **kwargs to the __call__ method of loss functions that don't use it (e.g. LeastSquares). Is it really necessary?

Contributor

jwkvam commented Feb 28, 2014

You're right, I removed them.

Owner

glouppe commented Mar 5, 2014

@jwkvam I had a quick look and it really is less intrusive! Thanks for the refactoring :)

Since this is still a WIP, do you need a review of something specific? Do you need help for something? We can give a hand if needed, just ask for it :)

Contributor

jwkvam commented Mar 10, 2014

@glouppe I just haven't had much time available, I'm not looking for any reviews yet. I should probably edit the initial comment to reflect the current status. Basically the plan I had was to

  1. Determine if caching maximum DCGs is worthwhile
  2. Final benchmarks with ranklib and gbm.
  3. Add yandex use example.
  4. Fill gaps in documentation and tests.
  5. Ask for final reviews.
    Sorry for how slow this is taking, if you would like to help out, I'd welcome it. Maybe just let me know what you are tackling?
Owner

ogrisel commented Mar 10, 2014

Note the yandex dataset is the small yandex 2009 dataset from:

http://imat2009.yandex.ru/en/datasets

not, the big 2013 kaggle dataset about personalized search.

Contributor

jwkvam commented Mar 10, 2014

Right, that's what I meant.

@ogrisel ogrisel referenced this pull request in ogrisel/scikit-learn Apr 5, 2014

Closed

Fix a broadcasting error. #9

@mblondel mblondel commented on the diff Jul 11, 2014

sklearn/ensemble/_gradient_boosting.pyx
+ cdef double score_diff
+ cdef double ndcg_diff
+ cdef double rho
+ cdef double max_dcg
+ cdef int sign
+
+ if max_rank is None:
+ max_rank = len(y_true)
+ max_dcg = _max_dcg(np.sort(y_true)[::-1][:max_rank])
+ cdef double ndcg = 0
+ if max_dcg != 0:
+ for i in range(max_rank):
+ for j in range(i + 1, y_true.shape[0]):
+ if y_true[i] != y_true[j]:
+ if j < max_rank:
+ ndcg_diff = ((y_true[j] - y_true[i]) / log(2 + i)
@mblondel

mblondel Jul 11, 2014

Owner

You can precompute the logs in an array of size n_samples. Also make sure to use logarithms in base 2.

Contributor

lazywei commented Jun 8, 2015

How is this PR going? i have some use cases for LambdaMART, and hence would love to see this be merged.
Thanks

Owner

agramfort commented Jun 8, 2015

see todo list on top. Also it needs rebase. Please take over if you can.

Owner

mblondel commented Jun 8, 2015

I personally have bad experience with LambdaMART. In my experience, it performs either worse or comparably to random forests or gradient boosting.

Ulden commented Aug 10, 2016

So how is things going ? Is LambdaMART available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment