ENH annotate metrics to simplify populating SCORERS #1774

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
4 participants
Owner

jnothman commented Mar 14, 2013

This patch makes needs_threshold and greater_is_better an attribute of each metric, rather than providing them when a Scorer constructed from them. I am not sure these are the most appropriate names, but their use means information isn't duplicated in scorer.py. (It also would allow for composite Scorers to be easily constructed assuming their support in #1768 is granted.)

Perhaps there should be other machine-readable annotations on our library of metrics:

  • lower and upper bound
  • is_symmetric
  • multiclass/binary/regression

but their application is not clear to me.

Owner

amueller commented Mar 14, 2013

I actually implemented exactly this and we decided against it.

Owner

amueller commented Mar 14, 2013

I am not entirely sure why you say information is not duplicated. Why is it duplicated now? It is now stored in the scorer file, while you stored it at the scoring functions.

Anyhow, I actually do like this approach but it was voted against it.

Owner

jnothman commented Mar 14, 2013

Oh, well. That's annoying. Can you point me to the reasons against?

I don't mean not duplicated. I mean not in multiple places.

Assuming #1768, note with this patch we can do something like the following:

class CompositeScorer(Scorer):
    def __init__(self, objective, **other_metrics):
        super(CompositeScorer, self).__init__(objective)
        self.metrics = ([], [])  # group metrics by their needs_threshold value
        other_metrics['score'] = objective
        for name, metric in six.iteritems(other_metrics):
            needs_threshold = self._get_annotation(metric, 'needs_threshold', False)
            self.metrics[int(needs_threshold)].append((name, metric))

    def calc_scores(self, estimator, X, y=None):
        for needs_threshold, metrics in enumerate(self.metrics):
            if not metrics:
                continue
            if needs_threshold:
                try:
                    y_pred = estimator.decision_function(X).ravel()
                except (NotImplementedError, AttributeError):
                    y_pred = estimator.predict_proba(X)[:, 1]
            else:
                y_pred = estimator.predict(X)
            for name, metric in metrics:
                yield name, self.score_func(y, y_pred)


binary_debug_scorer = CompositeScorer(f1_score,
    precision=precision_score, recall=recall_score,
    accuracy=accuracy_score, hinge_loss=hinge_loss,
    pr_curve=precision_recall_curve, roc_curve=roc_curve,
    confusion=confusion_matrix)
Owner

amueller commented Mar 14, 2013

Hum, I never really looked into more than one score actually ;)

Owner

larsmans commented Mar 14, 2013

@amueller Where was the vote? I like this approach better than the current one too.

Owner

amueller commented Mar 14, 2013

In the three pull requests I sent in and the massive thread on the ML.
http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K4aU7Dj-%3De_jDxLNUAoByNNvO9caOAs%2BYQ8K65ctGceGg%40mail.gmail.com&forum_name=scikit-learn-general

#1381 #1198 #1014

Actually the discussion there was if we should let the score_functions know how they can be applied.
Because we didn't do that was the reason to introduce the scorer objects.
If we do annotate them, I'm not entirely sure we need the scorer objects at all?
Well some of the code that is now in the scorer objects would be in GridSearchCV and cross_val_score.

I worked pretty long and had endless discussion to get any working solution into sklearn and I am a bit fed up with the issue. The current state is probably not the perfect final solution, though.
Talk to @GaelVaroquaux.

Owner

larsmans commented Mar 14, 2013

Alright. Shall we close this PR?

Owner

amueller commented Mar 14, 2013

Not sure ;)

Owner

jnothman commented Mar 14, 2013

Right, so this comes down to "what is Scorer for?". Currently it:

  • transforms a function over estimator output into one over estimator input, using needs_threshold to decide how;
  • notes whether the metric is a loss/score.

With this patch it still performs the first function, but is not necessary for the second.

With #1768 it also:

  • may output multiple scores by name (without necessarily having to call predict multiple times).

Things it does not do:

  • know how to aggregate its output (*SearchCV assumes sum, or arithmetic mean weighted by number of samples, if iid=True)
  • decide whether one score is better than another (despite having the facility to do so in greater_is_better).

So, what's a Scorer for??

Owner

amueller commented Mar 14, 2013

The idea behind the scorer was that it can do more than just threshold.
@GaelVaroquaux didn't like the idea of building scoring objects just around classification, though this is the main area they are used.
The scorer interface makes it possible to build all kind of estimator specific scores, for example taking AIC, BIC, model complexity or margin into account. These are not my ideas so I can't really speak for them.

Have you read the other pull requests and the thread?
My first solution was basically to add a "needs threshold" decorator to the loss. That would have solved my problem.
@GaelVaroquaux (and maybe also @ogrisel?) said this was to much classification specific.

Now you say you want multiple scores. I don't really see the use case for that, I must admit (I'd actually like to have less and more aggregated views of the results, not more verbose ones). But if you can implement what you want with the scorer interface, that is nice :)
So maybe this is a useful abstraction after all.

Btw, if you want to write less for the CompositeScorer, you can just use the Scorers instead of the functions and all will be good ;)

Owner

jnothman commented Mar 15, 2013

The scorer interface makes it possible to build all kind of estimator specific scores

Yes of course it does... so it makes the metric not just a function of the input to the estimator, but of the estimator itself.

No, I haven't read up on the history yet. It seems the problem comes down to the domain of predict sometimes being integer classes, and sometimes float regression values. Perhaps the exceptional case is a metric that is categorical, not one that needs_threshold. I.e. if the metric is marked categorical, the Scorer should use predict; otherwise it tries decision_function, predict_proba and predict...

With the calc_scores idea I suggested in #1768, it can output multiple scores, which I think is at least useful for something like GridSearchCV to keep the objective and other diagnostic evaluations, but discard the models built along the way.

Yes, I considered using Scorers in the CompositeScorer, but assuming we're using something like calc_scores, it's trickier to provide an API to rename the scores output from the constituent scorers' calc_scores. And if the Scorer is already composite, introspecting to use the minimal number of calls to predict or decision_function is messy. So composing a scorer out of metrics makes some more sense, except that it doesn't handle score functions that depend on the estimator...

Owner

jnothman commented Mar 15, 2013

Or perhaps the metrics should be annotated directly with something like:

@applies_to('predict')
@applies_to('decision_function', 'predict_proba')

:s

For the moment I am persuaded that this PR might be ignored until we decide whether we're interested in composite scorers as in #1768 .

Owner

arjoly commented Jul 25, 2013

I think that this pr can be close given the excellent work of many people (see ##2123).

@arjoly arjoly closed this Jul 25, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment