Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[WIP] sample_weight support #1574

Open
wants to merge 5 commits into
from

Conversation

Projects
None yet
8 participants
Owner

ndawe commented Jan 14, 2013

  • metrics (see #3043)
  • scorer interface (see #3098 and continued in #3401)
  • grid_search
  • cross_validation
  • learning_curve
  • rfe

ndawe added a commit to ndawe/scikit-learn that referenced this pull request Jan 14, 2013

Owner

amueller commented Jan 15, 2013

I'd really like to get in #1381 first, hopefully shortly after the release. Not sure how this fits into the interface implemented there.

@jnothman jnothman and 1 other commented on an outdated diff Jun 1, 2013

sklearn/grid_search.py
@@ -110,21 +111,28 @@ def fit_grid_point(X, y, base_clf, clf_params, train, test, scorer,
X_train = X[safe_mask(X, train)]
X_test = X[safe_mask(X, test)]
+ score_params = dict()
+ if sample_weight is not None:
+ sample_weight_train = sample_weight[safe_mask(sample_weight, train)]
+ sample_weight_test = sample_weight[safe_mask(sample_weight, test)]
+ fit_params['sample_weight'] = sample_weight_train
+ score_params['sample_weight'] = sample_weight_test
+
if y is not None:
y_test = y[safe_mask(y, test)]
y_train = y[safe_mask(y, train)]
clf.fit(X_train, y_train, **fit_params)
@jnothman

jnothman Jun 1, 2013

Owner

If there are sample weights, shouldn't we be training with them too? Or do we need separate parameters for fit_sample_weight and score_sample_weight??? (Perhaps such prefixing is a general solution to this trouble.)

@jnothman

jnothman Jun 1, 2013

Owner

Especially when fit is called with a sample_weight arg for best_estimator_

@ndawe

ndawe Jun 2, 2013

Owner

@jnothman yes, sample_weight is used in both the calls to fit and score in fit_grid_point. There is no need for separate sample_weight arrays since it is treated the same way X and y are (split according to test and train subarrays). Also, if refit is true the sample weights are again used.

@jnothman

jnothman Jun 2, 2013

Owner

Ahh. I missed the modification of fit_params below.

Owner

ndawe commented Jun 2, 2013

@jnothman @amueller I will bring this PR back to life now. I need to rebase and resolve a few conflicts.

Owner

ndawe commented Jun 2, 2013

@jnothman @amueller I've rebased and cleaned up a few things. A few of the metrics now support sample_weights but I think someone more knowledgeable of the other metrics should implement it there. This PR should otherwise be ready for review, particularly for the grid searching with sample weights.

@jnothman jnothman commented on an outdated diff Jun 2, 2013

sklearn/metrics/metrics.py
@@ -496,7 +505,10 @@ def _binary_clf_curve(y_true, y_score, pos_label=None):
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
# accumulate the true positives with decreasing threshold
- tps = y_true.cumsum()[threshold_idxs]
+ if sample_weight is not None:
+ tps = (y_true * sample_weight).cumsum()[threshold_idxs]
+ else:
+ tps = y_true.cumsum()[threshold_idxs]
fps = 1 + threshold_idxs - tps
@jnothman

jnothman Jun 2, 2013

Owner

I think this line must also take account for sample_weight, or else fps might be less than 0.

The test for this function will be that pr_curve outputs values corresponding to naively calculating precision_recall_fscore_support at each threshold. It would be a nice test to have anyway to check our implementation is sane.

@jnothman jnothman commented on an outdated diff Jun 2, 2013

sklearn/metrics/metrics.py
@@ -989,7 +1005,7 @@ def jaccard_similarity_score(y_true, y_pred, normalize=True, pos_label=1):
return np.sum(score)
-def accuracy_score(y_true, y_pred, normalize=True):
+def accuracy_score(y_true, y_pred, sample_weight=None, normalize=True):
@jnothman

jnothman Jun 2, 2013

Owner

I don't think it's safe to reorder the parameters on public functions.

Owner

jnothman commented Jun 2, 2013

It probably comes down to personal preference, but I generally dislike the metrics being calculated multiple times depending on the parameters. I.e. I would much rather something like this at the top, then treat having sample_weight as the normal case:

if sample_weight is None:
    sample_weight = 1
    total_weight = len(y)
else:
    total_weight = np.sum(sample_weight)

(assuming sample_weight is being used as a multiplier; where it's passed to np.average, np.bincount, etc, it's fine to leave sample_weight=None.)

@jnothman jnothman commented on an outdated diff Jun 2, 2013

sklearn/grid_search.py
@@ -190,7 +190,8 @@ def __len__(self):
return self.n_iter
-def fit_grid_point(X, y, base_clf, clf_params, train, test, scorer,
+def fit_grid_point(X, y, sample_weight, base_clf, clf_params,
@jnothman

jnothman Jun 2, 2013

Owner

I guess for clean code, handling sample_weight specially makes some sense, but I wonder: what other parameters will require being sliced per fold that users will expect similar support for?

Owner

jnothman commented Jun 2, 2013

Couple of questions:

  1. Does wanting to weight samples when fitting necessarily imply weighting should be used in scoring, and vice-versa?
  2. Are there references or reference implementations of the metrics with weights? Or are we following our intuitions?
Owner

jnothman commented Jun 2, 2013

In particular, how does weighting interact with multilabel outputs? I assume the weight is applied to the instance as a whole, spread uniformly over the proposed labels for that instance (where that's meaningful).

Owner

jnothman commented Jun 2, 2013

You also need a more general invariance test: for any metric, integer array sample_weight, and real k, the following should be equal: metric(y1, y2, sample_weight * k) == metric(np.repeat(y1, sample_weight), np.repeat(y2, sample_weight)).

Owner

jnothman commented Jun 3, 2013

I wrote:

Does wanting to weight samples when fitting necessarily imply weighting should be used in scoring, and vice-versa?
Thinking further: assuming weighting at fit and score time is the most common use case, the user that really wants this flexibility can wrap their scorer or estimator to discard the sample_weight parameter.

@jnothman jnothman commented on an outdated diff Jun 3, 2013

sklearn/grid_search.py
@@ -683,8 +706,11 @@ def fit(self, X, y=None, **params):
Target vector relative to X for classification;
None for unsupervised learning.
+ sample_weight : array-like, shape = [n_samples], optional
+ Sample weights.
@jnothman

jnothman Jun 3, 2013

Owner

This comment should specify that the weights are applied both in fitting and scoring, and that estimator.fit and scoring must support the sample_weight parameter. (And same in RandomizedSearchCV of course.)

Owner

jnothman commented Jun 5, 2013

This may be close to all the testing you need:

def test_sample_weight_invariance():
    y1, y2, _ = make_prediction(binary=True)
    int_weights = np.randint(10, size=y1.shape)
    for name, metric in METRICS_WITH_SAMPLE_WEIGHT:
        unweighted_score = metric(y1, y2, sample_weight=None)
        assert_equal(unweighted_score,
                     metric(y1, y2, sample_weight=np.ones(shape=y1.shape)),
                     msg='For %s sample_weight=None is not equivalent to '
                         'sample_weight=ones' % name)

        weighted_score = metric(np.repeat(y1, int_weights),
                                np.repeat(y2, int_weights))
        assert_not_equal(unweighted_score, weighted_score,
                         msg="Unweighted and weighted scores are unexpectedly "
                             "equal for %s" % name)
        assert_equal(weighted_score,
                     metric(y1, y2, sample_weight=int_weights),
                     msg='Weighting %s is not equal to repeating samples'
                         % name)
        for scaling in [2, 0.3]:
            assert_equal(weighted_score,
                         metric(y1, y2, sample_weight=int_weights * scaling),
                         msg='%s sample_weight is not invariant under scaling'
                             % name)
        assert_equal(weighted_score,
                     metric(y1, y2, sample_weight=list(scaling)),
                     msg='%s sample_weight is not invariant to list vs array'
                         % name)
Owner

ndawe commented Aug 22, 2013

@jnothman thanks for your help! I finally had some time to get back to this. This PR should now be ready.

Regarding your questions:

  1. Does wanting to weight samples when fitting necessarily imply weighting should be used in scoring, and vice-versa?

Yes. Performance on samples with larger weights matters more than performance on samples with smaller weights. So each sample should contribute by its weight to the overall score.

  1. Are there references or reference implementations of the metrics with weights? Or are we following our intuitions?

Probably, but as you can see in the PR the changes are quite trivial.

Owner

ndawe commented Aug 22, 2013

In other words, I am unaware of existing implementations of all of these metrics that account for sample weights. That would be helpful though. But, again, the modifications required here are not too serious.

In my field of research (high energy particle physics), samples are almost always weighted and the weights can vary a lot. So any classifier I use must account for the sample weights and any evaluation of a classifier's performance must also account for the sample weights.

Owner

arjoly commented Aug 23, 2013

One thing that would be worth thinking is the integration with the new scorer interface.

Owner

arjoly commented Aug 23, 2013

Hm we could have reviewed two pr. One with the metrics and one on top of the other with grid searching and cross validation score.

@arjoly arjoly and 1 other commented on an outdated diff Aug 23, 2013

sklearn/metrics/metrics.py
@@ -1812,7 +1839,7 @@ def recall_score(y_true, y_pred, labels=None, pos_label=1, average='weighted'):
@deprecated("Function zero_one_score has been renamed to "
'accuracy_score'" and will be removed in release 0.15.")
-def zero_one_score(y_true, y_pred):
+def zero_one_score(y_true, y_pred, sample_weight=None):
@arjoly

arjoly Aug 23, 2013

Owner

This function is deprecated and will be remove in the next release.

@arjoly

arjoly Aug 23, 2013

Owner

Maybe you are looking for zero_one_loss.

@ndawe

ndawe Aug 23, 2013

Owner

Ah, thanks.

Owner

ndawe commented Aug 23, 2013

@arjoly yes, true. But, I need the weighted score metrics from this PR (for the grid_search unit tests), and I don't want to have a bunch of cherry-picked commits in the other PR.

Owner

ndawe commented Aug 23, 2013

I see... would it be fine to keep rebasing a grid_search PR on this one? I can split them again... Although all I have left is to add sample_weight to the scorers and then add unit tests.

Owner

arjoly commented Aug 23, 2013

Are there references or reference implementations of the metrics with weights? Or are we following our intuitions?

Probably, but as you can see in the PR the changes are quite trivial.

For some metrics, I agree that taking into the weight is straightforward, such as the mean squared error. But for other metrics such as the roc_auc_score or r2_score, I would be glad to see some references.

Owner

jnothman commented Aug 24, 2013

@arjoly, I think definitions are pretty straight forward in terms of invariance with integer weights.

Coverage Status

Coverage remained the same when pulling 1ea2c7e on ndawe:weighted_scores into 613cf8e on scikit-learn:master.

Owner

jnothman commented Nov 22, 2013

@ndawe, is this still a WIP, or is it ready to be reviewed for merging?

Owner

ndawe commented Nov 23, 2013

@jnothman, yes this PR is ready to be reviewed for merging. Thanks a lot!

@jnothman jnothman and 1 other commented on an outdated diff Nov 25, 2013

sklearn/cross_validation.py
@@ -1125,7 +1139,8 @@ def cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1,
scores : array of float, shape=(len(list(cv)),)
Array of scores of the estimator for each run of the cross validation.
"""
- X, y = check_arrays(X, y, sparse_format='csr', allow_lists=True)
+ X, y, sample_weight = check_arrays(X, y, sample_weight,
+ sparse_format='csr', allow_lists=True)
@jnothman

jnothman Nov 25, 2013

Owner

I think this will pass through a sample_weight that is a list, which will break on sample_weight[train], etc. Could you please make every test with sample_weight ensure that a list can be used? You don't seem to have any tests for this module yet.

@ndawe

ndawe Feb 23, 2014

Owner

Done. Modified _safe_split accordingly.

@jnothman jnothman and 1 other commented on an outdated diff Nov 25, 2013

sklearn/ensemble/weight_boosting.py
@@ -104,7 +104,8 @@ def fit(self, X, y, sample_weight=None):
sample_weight[:] = 1. / X.shape[0]
else:
# Normalize existing weights
- sample_weight = np.copy(sample_weight) / sample_weight.sum()
+ sample_weight = (np.copy(sample_weight) /
@jnothman

jnothman Nov 25, 2013

Owner

I don't understand why an explicit copy is necessary.

@ndawe

ndawe Jan 16, 2014

Owner

Fixed. Indeed, that copy is not needed.

@jnothman jnothman and 1 other commented on an outdated diff Nov 25, 2013

sklearn/ensemble/weight_boosting.py
@@ -190,7 +191,7 @@ def _boost(self, iboost, X, y, sample_weight):
"""
pass
- def staged_score(self, X, y):
+ def staged_score(self, X, y, sample_weight=None):
@jnothman

jnothman Nov 25, 2013

Owner

Please add a quick test for this.

@jnothman jnothman and 1 other commented on an outdated diff Nov 25, 2013

sklearn/grid_search.py
@@ -261,21 +266,28 @@ def fit_grid_point(X, y, base_estimator, parameters, train, test, scorer,
X_train = X[safe_mask(X, train)]
X_test = X[safe_mask(X, test)]
+ score_params = dict()
+ if sample_weight is not None:
+ sample_weight_train = sample_weight[safe_mask(sample_weight, train)]
+ sample_weight_test = sample_weight[safe_mask(sample_weight, test)]
+ fit_params['sample_weight'] = sample_weight_train
@jnothman

jnothman Nov 25, 2013

Owner

Should fit_params be copied first?

@ndawe

ndawe Jan 16, 2014

Owner

Yes, fixed.

@jnothman jnothman and 1 other commented on an outdated diff Nov 25, 2013

sklearn/grid_search.py
@@ -261,21 +266,28 @@ def fit_grid_point(X, y, base_estimator, parameters, train, test, scorer,
X_train = X[safe_mask(X, train)]
X_test = X[safe_mask(X, test)]
+ score_params = dict()
+ if sample_weight is not None:
+ sample_weight_train = sample_weight[safe_mask(sample_weight, train)]
@jnothman

jnothman Nov 25, 2013

Owner

safe_mask shouldn't be necessary (sparse sample_weight is not supported).

@ndawe

ndawe Jan 16, 2014

Owner

Fixed. This logic is now part of _safe_split in cross_validation.py.

@jnothman jnothman and 1 other commented on an outdated diff Nov 25, 2013

sklearn/metrics/metrics.py
@@ -684,7 +736,7 @@ def roc_curve(y_true, y_score, pos_label=None):
fps = np.r_[0, fps]
thresholds = np.r_[thresholds[0] + 1, thresholds]
- if fps[-1] == 0:
+ if fps[-1] <= 0:
@jnothman

jnothman Nov 25, 2013

Owner

Why did we need this change?

@ndawe

ndawe Jan 16, 2014

Owner

Negative weights could yield an overall negative value there.

@jnothman

jnothman Jan 16, 2014

Owner

Ah! I didn't realise we were handling negative weights.

@ndawe

ndawe Jan 17, 2014

Owner

Well, I'm just handling this edge case to avoid surprises.

@jnothman

jnothman Jan 17, 2014

Owner

Is there literature that considers such things?

@ndawe

ndawe Jan 17, 2014

Owner

I doubt there is significant literature that considers negative sample weights. The one brief reference I am familiar with is section 3.1.2 in http://tmva.sourceforge.net/docu/TMVAUsersGuide.pdf. I can't find anything else on the great Internet. That shouldn't stop us from protecting against strangeness that may occur if fps[-1] goes negative here. Maybe this warrants a separate warning with a clearer message stating that the weighted number of negative samples in y_true is negative (similarly for positive samples).

@jnothman jnothman and 1 other commented on an outdated diff Nov 25, 2013

sklearn/metrics/tests/test_metrics.py
@@ -1999,3 +2009,39 @@ def test_log_loss():
y_pred = np.asarray(y_pred) > .5
loss = log_loss(y_true, y_pred, normalize=True, eps=.1)
assert_almost_equal(loss, log_loss(y_true, np.clip(y_pred, .1, .9)))
+
+
+def test_sample_weight_invariance():
+ """Test weighted metrics"""
+ y1, y2, _ = make_prediction(binary=True)
+ int_weights = np.random.randint(10, size=y1.shape)
@jnothman

jnothman Nov 25, 2013

Owner

Please use a fixed seed so that test result (and debugging output) is deterministic.

@jnothman jnothman and 1 other commented on an outdated diff Nov 25, 2013

sklearn/metrics/tests/test_metrics.py
@@ -1999,3 +2009,39 @@ def test_log_loss():
y_pred = np.asarray(y_pred) > .5
loss = log_loss(y_true, y_pred, normalize=True, eps=.1)
assert_almost_equal(loss, log_loss(y_true, np.clip(y_pred, .1, .9)))
+
+
+def test_sample_weight_invariance():
+ """Test weighted metrics"""
+ y1, y2, _ = make_prediction(binary=True)
+ int_weights = np.random.randint(10, size=y1.shape)
+ for name, metric in METRICS_WITH_SAMPLE_WEIGHT.items():
+ unweighted_score = metric(y1, y2, sample_weight=None)
+ assert_equal(
+ unweighted_score,
+ metric(y1, y2, sample_weight=np.ones(shape=y1.shape)),
+ msg="For %s sample_weight=None is not equivalent to "
+ "sample_weight=ones" % name)
+ weighted_score = metric(y1, y2, sample_weight=int_weights)
@jnothman

jnothman Nov 25, 2013

Owner

Please insert a blank line before this one, and after each assertion in this test.

@jnothman jnothman and 1 other commented on an outdated diff Nov 25, 2013

sklearn/tests/test_grid_search.py
@@ -642,3 +644,42 @@ def test_grid_search_with_multioutput_data():
correct_score = est.score(X[test], y[test])
assert_almost_equal(correct_score,
cv_validation_scores[i])
+
+
+def test_grid_search_with_sample_weights():
@jnothman

jnothman Nov 25, 2013

Owner

I think I would prefer a test with a dummy/mock estimator, cv and scorer, to ensure the correct weights are being passed to the correct methods; no more and no less. Also, please test *SearchCV.score and RandomizedSearchCV.fit.

Owner

jnothman commented Nov 25, 2013

I also think sample weights deserve a mention in the narrative docs on model evaluation?

@jnothman jnothman and 2 others commented on an outdated diff Nov 25, 2013

sklearn/metrics/tests/test_metrics.py
@@ -1999,3 +2009,39 @@ def test_log_loss():
y_pred = np.asarray(y_pred) > .5
loss = log_loss(y_true, y_pred, normalize=True, eps=.1)
assert_almost_equal(loss, log_loss(y_true, np.clip(y_pred, .1, .9)))
+
+
+def test_sample_weight_invariance():
+ """Test weighted metrics"""
+ y1, y2, _ = make_prediction(binary=True)
@jnothman

jnothman Nov 25, 2013

Owner

Please also test the multilabel case for applicable metrics.

@arjoly

arjoly Jan 16, 2014

Owner

Can you also test multioutput metrics? You can identify list that support those two format using the appropriate constant define at the beginning of the module.

@ndawe

ndawe Jan 17, 2014

Owner

Yes, I'll add tests for multioutput.

Owner

ndawe commented Nov 29, 2013

@jnothman thanks a lot for these very thorough comments. I will address each one next week.

Owner

jnothman commented Nov 30, 2013

I look forward to it!

Owner

ndawe commented Jan 16, 2014

Sorry for the delay... Rebasing on top of recent changes to grid_search and metrics is getting tedious. I wish sklearn had a more natural way to integrate sample weights from the beginning. Maybe sample weights are a bit of a foreign concept in most applications but sklearn would be nothing more than a toy in my field without support for sample weights. Ideas from others here on the best way forward would be helpful. I'm going to try finishing the rebase and add sample_weight to a few new places in cross_validation but I am worried that this might be all for nothing if the interface continues to change underneath me.

Owner

jnothman commented Jan 16, 2014

Ideas from others here on the best way forward would be helpful.

What concerns you in the way forward? I'm not sure what you're asking for
here.

On 16 January 2014 19:38, Noel Dawe notifications@github.com wrote:

Sorry for the delay... Rebasing on top of recent changes to grid_search
and metrics is getting tedious. I wish sklearn had a more natural way to
integrate sample weights from the beginning. Maybe sample weights are a bit
of a foreign concept in most applications but sklearn would be nothing more
than a toy in my field without support for sample weights. Ideas from
others here on the best way forward would be helpful. I'm going to try
finishing the rebase and add sample_weight to a few new places in
cross_validation but I am worried that this might be all for nothing if
the interface continues to change underneath me.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1574#issuecomment-32450487
.

Owner

ndawe commented Jan 16, 2014

Well, maybe there is a better way of passing around the X, y, ... dataset that is more easily extended to contain additional vectors (as least internally, as not to change the public API). Or maybe sample_weight is really the only additional vector one would need to pass along as part of the dataset and I should just continue to implement it as is done in this PR. In any case it would be helpful to collaborate with other PRs making changes that conflict with this PR. If/when this PR is merged, supporting sample_weight where possible should hopefully become a new standard.

Owner

ndawe commented Jan 16, 2014

I've finished the rebase and am updating this PR.

Owner

ndawe commented Jan 16, 2014

I'm aware of a failing test. Will fix after addressing the comments above.

Coverage Status

Coverage remained the same when pulling 3d84d45 on ndawe:weighted_scores into de2be61 on scikit-learn:master.

Owner

ndawe commented Jan 16, 2014

Tests are green again. @jnothman I will implement the extra tests you suggest above.

Owner

jnothman commented Jan 16, 2014

I note that _fit_and_score, which is now used by both cross_val_score and the grid searchers, has a rudimentary solution for training with sample_weight or similar parameters that are coindexed with X; but it does not pass them to scoring.

Btw, the scorer interface could change again (see #2759), so we should try to merge this before that :)

@arjoly arjoly and 1 other commented on an outdated diff Jan 16, 2014

sklearn/metrics/tests/test_metrics.py
@@ -311,6 +311,16 @@
###############################################################################
# Utilities for testing
+METRICS_WITH_SAMPLE_WEIGHT = {
@arjoly

arjoly Jan 16, 2014

Owner

Please use a list with the appropriate names instead of adding a new dictionnary.

@arjoly

arjoly Jan 17, 2014

Owner

Thanks!

Coverage Status

Coverage remained the same when pulling a1d013b on ndawe:weighted_scores into de2be61 on scikit-learn:master.

Owner

mblondel commented Jan 17, 2014

Or maybe sample_weight is really the only additional vector one would need to pass along as part of the dataset

Sounds unlikely. For example, learning to rank methods need a query parameter which has size n_samples and must be selected along with the data when doing grid search

Well, maybe there is a better way of passing around the X, y, ... dataset that is more easily extended to contain additional vectors (as least internally, as not to change the public API).

I also think that we could introduce a new fit_dataset(dataset) or something method some estimators would implement. The dataset object would encapsulate X, y, sample_weight/query and would provide a way of retrieving the sliced data.

@ogrisel had some related ideas if I remember correctly.

Owner

mblondel commented Jan 17, 2014

sample_weight are also very important for the research project I'm doing right now. For Ridge regression, you can just multiply X and y by np.sqrt(sample_weight) so I'm using this approach right now.

Owner

ndawe commented Jan 17, 2014

@mblondel glad to hear sample weights are also important to others.

For Ridge regression, you can just multiply X and y by np.sqrt(sample_weight) so I'm using this approach right now.

Sadly there is no shortcut like this for classification.

Coverage Status

Coverage remained the same when pulling b94e4ec on ndawe:weighted_scores into de2be61 on scikit-learn:master.

Owner

mblondel commented Jan 20, 2014

Or maybe sample_weight is really the only additional vector one would need to pass along as part of the dataset

Or maybe since sample_weight is available in so many estimators, we should start to view it as a first class citizen, just like X and y. Coming up with abstractions is nice but if we don't have a clear vision of future use cases (e.g. learning to rank), there's always a risk that the abstraction fails to encompass some corner cases.

Owner

mblondel commented Feb 24, 2014

@ndawe Out of curiosity, how are sample weights computed in your field? In my case, I can compute them as a function of X and y: sw = func(X, y). So, a constructor parameter which accepts callables would do the job. But I guess this approach may be limiting in other applications.

Owner

ndawe commented Feb 24, 2014

@mblondel that would be very limiting. In general, no, the weights cannot be computed like that. If they can, then you can just use that function to create the sample_weight array. In general the sample weights can depend on many other values that never end up in X or y.

Owner

ndawe commented Feb 24, 2014

In my field, the sample weights are often a product of many weights. Some of the weights in this product are relatively expensive to compute and require many additional fields that we don't keep in the final X. We often compute the weights and reduce our data down to a reasonably-sized X on a worldwide grid of computing centres.

@jnothman jnothman commented on an outdated diff Feb 24, 2014

sklearn/metrics/tests/test_metrics.py
@@ -308,10 +308,50 @@
"macro_recall_score", "log_loss", "hinge_loss"
]
-###############################################################################
-# Utilities for testing
+# Metrics that support weighted samples
+METRICS_WITH_SAMPLE_WEIGHT = [
+ "accuracy_score", "unnormalized_accuracy_score",
+ "zero_one_loss", "unnormalized_zero_one_loss",
+ "precision_score", "average_precision_score",
+ "recall_score",
+ "f1_score", "f2_score", "f0.5_score",
+ "roc_auc_score",
+ "explained_variance_score",
+ "mean_squared_error",
+ "mean_absolute_error",
+ "r2_score",
@jnothman

jnothman Feb 24, 2014

Owner

please include the different weighted variants of P/R/F/ROC/AvgPrec

@jnothman jnothman commented on an outdated diff Feb 24, 2014

sklearn/metrics/tests/test_metrics.py
@@ -2515,3 +2555,82 @@ def test_averaging_multilabel_all_ones():
for name in METRICS_WITH_AVERAGING:
yield (check_averaging, name, y_true, y_true_binarize, y_pred,
y_pred_binarize, y_score)
+
+
+def _sample_weight_invariance(name, metric, y1, y2):
@jnothman

jnothman Feb 24, 2014

Owner

call this check_sample_weight_invariance

@jnothman jnothman and 1 other commented on an outdated diff Feb 24, 2014

sklearn/metrics/tests/test_metrics.py
+ # common factor
+ for scaling in [2, 0.3]:
+ assert_almost_equal(
+ weighted_score,
+ metric(y1, y2, sample_weight=sample_weight * scaling),
+ err_msg="%s sample_weight is not invariant "
+ "under scaling" % name)
+
+def test_sample_weight_invariance():
+ """Test weighted metrics"""
+ # generate some data
+ y1, y2, _ = make_prediction(binary=True)
+
+ for name in METRICS_WITH_SAMPLE_WEIGHT:
+ metric = ALL_METRICS[name]
+ _sample_weight_invariance(name, metric, y1, y2)
@jnothman

jnothman Feb 24, 2014

Owner

Use yield check_sample_weight_invariance, name, metric, y1, y2. It gives clearer output.

@jnothman jnothman commented on an outdated diff Feb 24, 2014

sklearn/metrics/tests/test_metrics.py
+ metric = ALL_METRICS[name]
+ _sample_weight_invariance(name, metric, y1, y2)
+
+ # multilabel
+ n_classes = 4
+ n_samples = 50
+ _, y1_multilabel = make_multilabel_classification(
+ n_features=1, n_classes=n_classes,
+ random_state=0, n_samples=n_samples)
+ _, y2_multilabel = make_multilabel_classification(
+ n_features=1, n_classes=n_classes,
+ random_state=1, n_samples=n_samples)
+
+ for name in MULTILABEL_METRICS_WITH_SAMPLE_WEIGHT:
+ metric = ALL_METRICS[name]
+ _sample_weight_invariance(name, metric, y1_multilabel, y2_multilabel)
@jnothman

jnothman Feb 24, 2014

Owner

yield here

@jnothman jnothman commented on an outdated diff Feb 24, 2014

sklearn/metrics/tests/test_metrics.py
+ random_state=0, n_samples=n_samples)
+ _, y2_multilabel = make_multilabel_classification(
+ n_features=1, n_classes=n_classes,
+ random_state=1, n_samples=n_samples)
+
+ for name in MULTILABEL_METRICS_WITH_SAMPLE_WEIGHT:
+ metric = ALL_METRICS[name]
+ _sample_weight_invariance(name, metric, y1_multilabel, y2_multilabel)
+
+ # multioutput
+ y1_multioutput = np.array([[1, 0, 0, 1], [0, 1, 1, 1], [1, 1, 0, 1]])
+ y2_multioutput = np.array([[0, 0, 1, 1], [1, 0, 1, 1], [1, 1, 0, 1]])
+
+ for name in MULTIOUTPUT_METRICS_WITH_SAMPLE_WEIGHT:
+ metric = ALL_METRICS[name]
+ _sample_weight_invariance(name, metric, y1_multioutput, y2_multioutput)
@jnothman

jnothman Feb 24, 2014

Owner

yield here

@jnothman jnothman and 1 other commented on an outdated diff Feb 25, 2014

sklearn/cross_validation.py
cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
scorer = check_scoring(estimator, score_func=score_func, scoring=scoring)
# We clone the estimator to make sure that all the folds are
# independent, and that it is pickle-able.
parallel = Parallel(n_jobs=n_jobs, verbose=verbose,
pre_dispatch=pre_dispatch)
- scores = parallel(delayed(_fit_and_score)(clone(estimator), X, y, scorer,
- train, test, verbose, None,
- fit_params)
- for train, test in cv)
+ scores = parallel(delayed(_fit_and_score)(
+ clone(estimator), X, y, sample_weight, scorer,
+ train, test, verbose, None,
+ fit_params)
+ for train, test in cv)
@ndawe

ndawe Feb 25, 2014

Owner

It's clean.

@jnothman

jnothman Feb 25, 2014

Owner

oh. I find it hard to read.

Do you object to the following alternative?

    scores = parallel(delayed(_fit_and_score)(clone(estimator), X, y,
                                              sample_weight, scorer,
                                              train, test, verbose, None,
                                              fit_params)
                      for train, test in cv)
@ndawe

ndawe Feb 25, 2014

Owner

Sure, that looks fine. I just like to try to keep things as close to the left margin as possible to avoid constantly battling with the 80 character length limit.

@jnothman jnothman and 1 other commented on an outdated diff Feb 25, 2014

sklearn/cross_validation.py
@@ -1190,15 +1199,28 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters,
start_time = time.time()
- X_train, y_train = _safe_split(estimator, X, y, train)
- X_test, y_test = _safe_split(estimator, X, y, test, train)
+ X_train, y_train, sample_weight_train = _safe_split(
+ estimator, X, y, sample_weight, train)
+ X_test, y_test, sample_weight_test = _safe_split(
+ estimator, X, y, sample_weight, test, train)
+
+ test_score_params = dict()
@jnothman

jnothman Feb 25, 2014

Owner

nitpick: could you use {} instead of dict? it is more conventional (and happens to be faster)

@ndawe

ndawe Feb 25, 2014

Owner

Interesting. Sure. I always use {} but saw in other scikit-learn code the use of dict() and list() so I thought that was preferred in scikit-learn.

@jnothman

jnothman Feb 25, 2014

Owner

There's probably some random variation in there...

dict(iterator) or dict(counter) is often the right choice. dict(a=1, b=2)
can often look neater than {'a': 1, 'b': 2}. But for an empty dict, I think
{} is clearest.

On 25 February 2014 12:17, Noel Dawe notifications@github.com wrote:

In sklearn/cross_validation.py:

@@ -1190,15 +1199,28 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters,

 start_time = time.time()
  • X_train, y_train = _safe_split(estimator, X, y, train)
  • X_test, y_test = _safe_split(estimator, X, y, test, train)
  • X_train, y_train, sample_weight_train = _safe_split(
  •    estimator, X, y, sample_weight, train)
    
  • X_test, y_test, sample_weight_test = _safe_split(
  •    estimator, X, y, sample_weight, test, train)
    
  • test_score_params = dict()

Interesting. Sure. I always use {} but saw in other scikit-learn code the
use of dict() and list() so I thought that was preferred in scikit-learn.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1574/files#r10020456
.

@jnothman jnothman commented on an outdated diff Feb 25, 2014

sklearn/metrics/metrics.py
@@ -132,6 +133,37 @@ def _check_clf_targets(y_true, y_pred):
return y_type, y_true, y_pred
+def average_and_variance(values, sample_weight=None):
@jnothman

jnothman Feb 25, 2014

Owner

make this _average_and_variance

Owner

jnothman commented Feb 25, 2014

This is looking really good!

I know it's a hassle, but testing of methods accepting sample_weight is still a bit deficient: So far for boosting, cross_val_score, and grid search you test that it will accept sample_weight, but don't test that it has the correct effect. It's hard to test for the correct effect, but checking that it has some effect would be worthwhile. And currently there is no test that RFECV, learning_curve and validation_curve even accept the parameter. The scorers are only tested indirectly, which means we've not ensured each type of scorer operates correctly.

Coverage Status

Coverage remained the same when pulling b54fda9 on ndawe:weighted_scores into 69b7162 on scikit-learn:master.

Owner

mblondel commented Feb 25, 2014

In the grid search code, would it be conceivable to use a simple convention: that all fit_params that start with the prefix sample_ need to be passed masked to the estimator to be tuned. This would allow to handle sample_weight and sample_group (for learning to rank).

Owner

jnothman commented Feb 25, 2014

In the grid search code, would it be conceivable to use a simple convention: that all fit_params that start with the prefix sample_ need to be passed masked to the estimator to be tuned. This would allow to handle sample_weight and sample_group (for learning to rank).

The current convention in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L1184 is more general in assuming anything with length n_samples should be masked. The trickier part is differentially routing these to fit, decision_function and the metric.

Coverage Status

Coverage remained the same when pulling 85596d3 on ndawe:weighted_scores into 69b7162 on scikit-learn:master.

Owner

mblondel commented Feb 25, 2014

We can add annotation to scorers in order to query whether they have sample_* support. @ndawe seems to be adding sample weight support to all metrics but there might be metrics where sample weight is ill-defined (I'm not sure) and sample group will definitely not be available in all metrics.

Owner

jnothman commented Feb 25, 2014

I don't think it's that simple. This PR doesn't add sample_weight for all
metrics, afaik. What happens where one release has a metric without
sample_weight support, and the next supports it. Does the score then change
from version to version?

On 25 February 2014 18:10, Mathieu Blondel notifications@github.com wrote:

We can add annotation to scorers in order to query whether they have
sample_* support. @ndawe https://github.com/ndawe seems to be adding
sample weight support to all metrics but there might be metrics where
sample weight is ill-defined (I'm not sure) and sample group will
definitely not be available in all metrics.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1574#issuecomment-35981166
.

Owner

ndawe commented Feb 25, 2014

Well, if a score gains sample weight support, the score value without weights specified will remain the same, but the user should expect the score values to change when he/she specifies sample weights.

Owner

ndawe commented Feb 25, 2014

I'll add more test coverage.

Coverage Status

Coverage remained the same when pulling 6cf1291 on ndawe:weighted_scores into 69b7162 on scikit-learn:master.

Owner

mblondel commented Feb 26, 2014

We should just make sure that all metrics for which it's possible implement sample_weight before next release.

Owner

jnothman commented Feb 26, 2014

I think it should be possible (but not necessarily efficient) for all
metrics, because if the sample weights are scaled to be integer, the metric
should be equivalent to repeating those samples.

On 26 February 2014 14:29, Mathieu Blondel notifications@github.com wrote:

We should just make sure that all metrics for which it's possible
implement sample_weight before next release.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1574#issuecomment-36087346
.

Owner

mblondel commented Feb 26, 2014

Assuming the metric is linear w.r.t. n_samples (e.g., a sum of sample-wise losses).

Owner

mblondel commented Feb 26, 2014

This should of course be done in another PR but we should really break down metrics.py and test_metrics.py into several parts.

Owner

jnothman commented Mar 17, 2014

@ndawe, your last comment was "I'll add more test coverage." I now can't recall what test those might be, but please let me know when we want to push through with this.

Owner

ndawe commented Mar 19, 2014

I am adding tests that address your earlier comment:

I know it's a hassle, but testing of methods accepting sample_weight is still a bit deficient: So far for boosting, cross_val_score, and grid search you test that it will accept sample_weight, but don't test that it has the correct effect. It's hard to test for the correct effect, but checking that it has some effect would be worthwhile. And currently there is no test that RFECV, learning_curve and validation_curve even accept the parameter. The scorers are only tested indirectly, which means we've not ensured each type of scorer operates correctly.

Just working on it in time between other important tasks... It's coming though.

Owner

jnothman commented Mar 19, 2014

Ah. Those tests.

On 19 March 2014 19:48, Noel Dawe notifications@github.com wrote:

I am adding tests that address your earlier comment:

I know it's a hassle, but testing of methods accepting sample_weight is
still a bit deficient: So far for boosting, cross_val_score, and grid
search you test that it will accept sample_weight, but don't test that it
has the correct effect. It's hard to test for the correct effect, but
checking that it has some effect would be worthwhile. And currently there
is no test that RFECV, learning_curve and validation_curve even accept the
parameter. The scorers are only tested indirectly, which means we've not
ensured each type of scorer operates correctly.

Just working on it in time between other important tasks... It's coming
though.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1574#issuecomment-38028481
.

Owner

arjoly commented Mar 19, 2014

Staring at this very intetesting pull request for a while. I am asking to myself "would it be easier to get it merged and reviewed if all contributions were divided into smaller and meaningfull pull requests?"

Thanks you @ndawe for working on this subject.

Owner

ndawe commented Mar 20, 2014

@arjoly I see your point, but in some sense this is "all or nothing" in a few places. Support for grid searching with sample weights could be a separate PR, yes. I'll try my best to finish this up ASAP.

Owner

jnothman commented Mar 20, 2014

Support for metric weights could be merged as is. If that's the path you
want to take, I give it a big +1.

And the rest is fine, it just needs testing.

On 20 March 2014 17:25, Noel Dawe notifications@github.com wrote:

@arjoly https://github.com/arjoly I see your point, but in some sense
this is "all or nothing" in a few places. Support for grid searching with
sample weights could be a separate PR, yes. I'll try my best to finish this
up ASAP.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1574#issuecomment-38138539
.

Owner

jnothman commented Mar 20, 2014

(I guess the problem is that you can change the metrics, but the scoring
interface changes make more sense with the CV changes.)

On 20 March 2014 17:27, Joel Nothman jnothman@student.usyd.edu.au wrote:

Support for metric weights could be merged as is. If that's the path you
want to take, I give it a big +1.

And the rest is fine, it just needs testing.

On 20 March 2014 17:25, Noel Dawe notifications@github.com wrote:

@arjoly https://github.com/arjoly I see your point, but in some sense
this is "all or nothing" in a few places. Support for grid searching with
sample weights could be a separate PR, yes. I'll try my best to finish this
up ASAP.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1574#issuecomment-38138539
.

Owner

arjoly commented Apr 1, 2014

@arjoly I see your point, but in some sense this is "all or nothing" in a few places. Support for grid searching with sample weights could be a separate PR, yes. I'll try my best to finish this up ASAP.

Would it make sense to follow the following order

  1. Metrics
  2. scorer interface + score in the base class

then you could treat these module separately in any order

  • cross_val_score
  • grid search
  • rfe
  • learning_curve
    ?
Owner

jnothman commented Apr 1, 2014

Well if you support the metrics changes, you can merge them if you want, @arjoly (they have my +1). I'm ambivalent either way. The pain is working out a sensible testing regime.

Coverage Status

Coverage remained the same when pulling 9696ecb on ndawe:weighted_scores into 070d7a6 on scikit-learn:master.

Owner

ndawe commented Apr 6, 2014

@jnothman @arjoly I am opening up a new PR with only the metrics changes and then the other components will follow as you suggested.

@ndawe ndawe changed the title from [MRG] sample_weight support to [WIP] sample_weight support Apr 9, 2014

Coverage Status

Changes Unknown when pulling 0ce2180 on ndawe:weighted_scores into * on scikit-learn:master*.

Owner

ndawe commented Apr 22, 2014

Cleaned up the history. Will now start a PR for sample_weight in the scorer interface.

Coverage Status

Coverage remained the same when pulling bf4cd4c on ndawe:weighted_scores into d62971d on scikit-learn:master.

Owner

vene commented Jul 16, 2014

The grid_search, rfe, learning_curve and cross_validation part of this PR should wait for #3340, which will conflict in many ways.

Owner

ndawe commented Jul 16, 2014

Thanks @vene. Sounds good. I've updated the description with a link to your #3401

Owner

vene commented Jul 16, 2014

Thanks Noel! I was thinking that the refactoring in #3340 might be a bit too big (and therefore slow to get in) while the remaining part of this PR are much more manageable, so I'd rather try to get #3340 broken into two and merge this faster. IMO these are important usability issues. We'll take this up tomorrow at the sprint. Ping @amueller, @arjoly and @GaelVaroquaux

Owner

vene commented Aug 1, 2014

In the grid search code, would it be conceivable to use a simple convention: that all fit_params that start with the prefix sample_ need to be passed masked to the estimator to be tuned. This would allow to handle sample_weight and sample_group (for learning to rank).

The current convention in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L1184 is more general in assuming anything with length n_samples should be masked. The trickier part is differentially routing these to fit, decision_function and the metric.

@jnothman, @ndawe
Actually there's no question of routing: the convention only applies to the fit_params, which naturally should only be sent to fit. We would have a separate scorer_params or something: one for every function of an estimator that _fit_and_score can call.

I really need this right now because of sample groups (ie query_id).

I would also prefer the convention of giving them names that start with sample: sample_weights, sample_groups, etc, it's more explicit and less bug-prone (what if i make an off-by-one error when building the sample_weights vector?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment