Skip to content

option to return an array in metrics if multi-output #2200

Closed
mblondel opened this Issue Jul 24, 2013 · 22 comments

5 participants

@mblondel
scikit-learn member

Thanks to the work of @arjoly, regression metrics now support multiple outputs (2d Y). Currently, the metrics return a scalar. It would be nice to have an option to return an array of size n_outputs.

@mblondel
scikit-learn member

Also, multiple outputs are currently handled by flattening 2d arrays and view them as 1d array. This corresponds to micro averaging. For my application, I would prefer macro averaging (averaging over classes). For example, for the R^2 score that would be: np.mean([r2_score(Y_true[:, k], Y_pred[:, k]) for k in xrange(Y_true.shape[1])])

CC @ogrisel

@mblondel
scikit-learn member

I'm not even sure micro averaging makes sense at all here. To be discussed...

@ogrisel
scikit-learn member
ogrisel commented Jul 24, 2013

Sounds like a reasonable request although I don't have any practical experience with scoring multi target / output regression model my self.

@arjoly
scikit-learn member
arjoly commented Jul 25, 2013

This makes sense as well a weighting the output.

@MechCoder
scikit-learn member

@ogrisel , @mblondel Hi, I would like to work on this issue, I just skimmed through the metrics, are you referring to something like this, in the r**2 implementation?

y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
r2_score(y_true, y_pred)
0.938                                # It currently returns
[0.96551724,  0.91588785]  # It should return something like this?

Do correct me if I'm wrong.

@mblondel
scikit-learn member
mblondel commented Oct 3, 2013

@Manoj-Kumar-S Yep, exactly. Thanks!

@mblondel
scikit-learn member
mblondel commented Oct 3, 2013

It should be an option though (e.g., multi_output=True).

@MechCoder
scikit-learn member

Great. I'm on it. I'll hopefully come up with a PR in 2-3 days.

@arjoly
scikit-learn member
arjoly commented Oct 3, 2013

It should be an option though (e.g., multi_output=True).

I would use the keyword average to be consistent with the rest of the metrics.

@mblondel
scikit-learn member
mblondel commented Oct 7, 2013

average='micro'|'macro'|None|False

Currently only micro average (average over the samples) is implemented but IMO macro average (average over classes) makes more sense.

@arjoly
scikit-learn member
arjoly commented Oct 10, 2013

Also, multiple outputs are currently handled by flattening 2d arrays and view them as 1d array. This corresponds to micro averaging. For my application, I would prefer macro averaging (averaging over classes). For example, for the R^2 score that would be: np.mean([r2_score(Y_true[:, k], Y_pred[:, k]) for k in xrange(Y_true.shape[1])])

Is it really micro-averaged r2 score ? Just a small experiment

In [1]: import numpy as np
In [2]: y_true = np.random.rand(5, 3)
In [3]: y_pred = np.random.rand(5, 3)
In [4]: from sklearn.metrics import r2_score

# Current multi-output r2_score
In [5]: r2_score(y_true, y_pred)
Out[5]: -1.2018060998146924

 # It would be micro-r2 score
In [6]: r2_score(y_true.ravel(), y_pred.ravel())
Out[6]: -1.1395845816752996

In [7]: from sklearn.metrics import explained_variance_score

# Check that it's equal to r2_score in this case
In [8]: explained_variance_score(y_true.ravel(), y_pred.ravel()) 
Out[8]: -1.132385768714816

# r2-score with no averaging
In [9]: r2 = [r2_score(y_true[:, i], y_pred[:, i]) for i in range(y_true.shape[1])] 
In [10]: r2
Out[10]: [-1.0513131617660676, -1.2263410810199482, -1.2582117503263115]

# It would be macro-r2 score
In [11]: np.mean(r2) 
Out[11]: -1.178621997704109

# For reproducibility
In [12]: y_true
Out[12]: 
array([[ 0.28481499,  0.34159449,  0.89364091],
       [ 0.08516499,  0.24426185,  0.58491767],
       [ 0.65374035,  0.78358486,  0.84892285],
       [ 0.12355558,  0.32354626,  0.02966046],
       [ 0.65858239,  0.59705347,  0.00573082]])

In [13]: y_pred
Out[13]: 
array([[ 0.32639174,  0.87657742,  0.23203866],
       [ 0.66826156,  0.06449232,  0.21180403],
       [ 0.19938095,  0.65445628,  0.13731781],
       [ 0.19451816,  0.10242323,  0.50932089],
       [ 0.95501124,  0.33805111,  0.61441609]])
@mblondel
scikit-learn member

Interesting... How is the returned value computed for In [5] then?

@arjoly
scikit-learn member
arjoly commented Oct 10, 2013

The denominator is computed differently.

    numerator = ((y_true - y_pred) ** 2).sum(dtype=np.float64)
    denominator = ((y_true - y_true.mean(axis=0)) ** 2).sum(dtype=np.float64)
@mblondel
scikit-learn member

I think that the above could be called micro average in the sense that you compute the 2d-array ((y_true - y_true.mean(axis=0)) ** 2) then sum over it with axis=None. But this is indeed different from flattening the entire array.

But I'm starting to think that we should only support macro average, i.e., the average of the per-output scores.

@MechCoder
scikit-learn member

I am finally making sense of this discussion here.

Macro averaging is the same as doing np.mean(r2_score(array, average=None)) in my branch.
I'm a bit confused about micro averaging though, does it mean you flatten the 2-D array into a 1-D array, and perform the calculation.
I think doing

denominator = ((y_true - y_true.mean()) ** 2).sum(dtype=np.float64)

would do the trick right? which is equivalent, to flattening it to a 1-D array.
Is there any textbook definition for micro averaging?

@MechCoder
scikit-learn member

And what do you think would be the best thing to do in my PR right now? Just implement average=None, for the multi-output case and average="macro" which corresponds to the mean of the average=None case?

@mblondel
scikit-learn member

The concepts of micro and macro averages arise when computing metrics which are originally designed for binary classification (e.g., precision, recall) in the multiclass case. micro=average over instances, macro=average over classes.

Here, I think that macro average makes the most sense (average over outputs). "micro" average seems a bit ambiguous and ill-defined.

@MechCoder
scikit-learn member

Got it. So the best thing to do now, would be just to have a None and macro case?

@mblondel
scikit-learn member

Let's wait for other people's opinion. The macro case can be implemented recursively (e.g., by calling np.mean(r2_score(..., average=None)) inside r2_score. I don't think there's much to gain by vectorizing the operations.

@arjoly
scikit-learn member
arjoly commented Jul 20, 2014

To not lose discussion in #2493

During the sprint, we discuss (me, @eickenberg and @MechCoder ) about the blocking points of this pull request. It turns ont the difference between macro-averaging and and the current implementation could be solved using output_weights properly.

The macro-r2 / macro-explained variance correspond to uniform output_weight (= 1 / n_outputs) and the current version use output_weight proportional to the fraction of variance explained by each output.

Thus we decided to keep both version. I am also fine with changing default to macro.

@amueller
scikit-learn member
amueller commented May 8, 2015

Closed by #4491, right?

@MechCoder
scikit-learn member

yes indeed.

@MechCoder MechCoder closed this May 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.