-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Additional Forecasting Performance Metrics #672
Conversation
Thanks again @RNKuhns - happy to update the scikit-learn dependency so that we can reuse as much of their code as possible. Regarding unit testing, I suggest to have
|
Let's rename |
The testing and switching thing from y_test to y_true sound good to me. On the scikit-learn dependency, last I checked the Anaconda main channel was still on scikit-learn version 0.23.2. To make things easiest on people using conda, it might be worth waiting to enforce that dependency until they roll the Anaconda main channel to 0.24. I've had some stuff pop up that got me a bit behind on this, but I'm expecting to have time to work on this on Friday. |
The y_test to y_true naming change is done and I'm wrapping up the tests. @mloning I've got the test to verify the output values completed and I'm wrapping up the tests to compare the function outputs to the class outputs, but I've got a follow-up question on the first test you suggested input and output types). I get how to check the output types, but what should I do to check the input types in terms of unit testing (I'm checking types in functions themselves). |
@RNKuhns for input types, I'd just check if appropriate errors are raised, e.g. if two series are passed of unequal length or wrong types (e.g. lists instead of pandas Series), you can use |
Other minor tweaks to forecasting functions that were identified as part of unit testign.
Fixed __init__ in all classes.
Verification applies to relative error functions.
Since naming convention of forecasting performance metrics was changed, the old references to things like sMAPE and smape_loss had to be updated.
@mloning I believe I've got this pretty close to finished. I've added the discussed unit tests, as well as a few more. types of unit tests and they are passing locally. Eventually it would be good to add unit tests of the multivariate functionality of the functions and also the ability to provide forecast horizon weights to return weighted metrics. I'm using the same coding approach as sklearn for both, but I haven't written unit tests for that functionality yet (by default it isn't used). Is it fine to leave those for later or do you want that functionality tested now too? I've got the tests being used for each of the new functions (and I'm testing equality with the performance metric classes) for all the functions except for The output of these both depend on user choices (e.g the loss function to use in relative loss and in mean_asymmetric_error the asymmetric threshold and functions to use on either side of this threshold). I'll add a these to the test cases for a specific set of choices for the other arguments to each function. Note that I also haven't created a class corresponding to either of these functions. If the user wanted to use these, they'd need to use Can you let me know if not having corresponding classes for these two functions and also just testing a 1 specific set of arguments for each of these functions will work. I've also gone through and changed any references to the old naming conventions (e.g. sMAPE, smape_loss, etc) to the new naming conventions (e.g. SymmetricMeanAbsolutePercentageError and symmetric_mean_absolute_percentage_error). Note that I still need to finalize the docstrings for each of the new forecasting loss functions (they are close but they are not finished yet). |
sktime/forecasting/all/__init__.py
Outdated
MedianAbsoluteError, | ||
MedianSquaredError, | ||
RootMedianSquaredError, | ||
SymmetricMeanAbsolutePercentageError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see the problem of having abbreviations (sMAPEmay refer to mean or median). Is there any other way to handle this? Or could we include the abbreviations still even if we just redefine them sMAPE = SymmetricMeanAbsoluteError
, but duplication also isn't great ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mloning can you elaborate on your comment?
I thought about using abbreviation sMAPE and sMdAPE, but I think we should match the metric naming style of scikit-learn regression metrics. Scikit-learn has functions that calculate MSE, MAPE and MdAE and are named mean_square_error, mean_absolute_percentage_error and median_absolute_error.
Also in terms of duplication do you mean having a separate function for the median and mean metrics or something else?
I actually coded a symmetric=False
default argument into the mean_absolute_percentage_error function and then have symmetric_mean_absolute_percentage_error as a wrapper around that implementation. I could go either way in terms whether we exclude the symmetric_mean_absolute_percentage_error function and just let users toggle the symmetric
argument in mean_absolute_percentage_error based on their preference.
In terms of the separate functions for mean and median, I considered whether to include a aggregation_function
parameter that let people choose {'max', 'mean', 'median'}. That would reduce the number of functions, but it would lead to a more ambiguous name (we'd have to name the function something like root_squared_error and then let people chose whether it is mean, median or max version). It would also move away from using names that match the scikit-learn regression metric naming style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay @RNKuhns I'm happy with using the long names, makes it more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could simply add shorthands for common loss functions, e.g. smape = symmetric_mean_absolute_error
so that users can also import smape
- but not sure if the benefit of having shorter function names for common errors outweighs the disadvantage of introducing two ways of doing the same thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, using arguments like {'max', 'mean', 'median'} isn't very intuitive.
There's one questions which I'm not sure about: how to handle
We can see already in the unit tests that option 2 requires us to always make a case distinction. A similar distinction will be required in all functionality related to model evaluation (e.g. The loss classes already have a few attributes associated with them (e.g. |
@mloning I think I'm close to wrapping up the changes. Like we talked about I am using a Last thing I'm trying to figure out is whether we should still have separate classes in This is probably best illustrated with an example. With these changes the API for def mean_absolute_percentage_error(
y_true, y_pred, horizon_weight=None, multioutput="uniform_average",
symmetric=False
):
... One option would be to create a single class as an object-oriented counterpart to the above function, that would look something like: class MeanAbsolutePercentageError(MetricFunctionWrapper):
def __init__(self):
name = "MeanAbsolutePercentageError"
fn = mean_absolute_percentage_error
greater_is_better = False
super(MeanAbsolutePercentageError, self).__init__(
fn=fn, name=name, greater_is_better=greater_is_better
) Where the callability is inherited from the base class MetricFunctionWrapper:
def __init__(self, fn, name=None, greater_is_better=False):
self.fn = fn
self.name = name if name is not None else fn.__name__
self.greater_is_better = greater_is_better
def __call__(self, y_true, y_pred, *args, **kwargs):
return self.fn(y_true, y_pred, *args, **kwargs) To return the symmetric version of MeanAbsolutePercentageError the call would have to be: MeanAbsolutePercentageError(y_true, y_pred, symmetric=True) but that seems like it would be awkward for tuning/evaluation (or am I overthinking it?). The other two options I've thought of are: # Option 1: Have seperate classes for symmetric and regular mean absolute percentage error
# Specify symmetric=True in __call__ defined within the class
class SymmetricMeanAbsolutePercentageError(MetricFunctionWrapper):
def __init__(self):
name = "MeanAbsolutePercentageError"
fn = mean_absolute_percentage_error
greater_is_better = False
super(MeanAbsolutePercentageError, self).__init__(
fn=fn, name=name, greater_is_better=greater_is_better
)
def __call__(self, y_true, y_pred, *args, **kwargs):
return self.fn(y_true, y_pred, *args, symmetric=True, **kwargs)
# This would let a user just use the SymmetricMeanAbsolutePercentageError class and not think about it
smape = SymmetricMeanAbsolutePercentageError()
smape(y_true, y_pred)
# Option 2: Have __init__ accept symmetric kwarg and store in self._symmetric attribute
# Specify symmetric=self._symmetric in __call__ defined within the class
class MeanAbsolutePercentageError(MetricFunctionWrapper):
def __init__(self, symmetric=False):
name = "MeanAbsolutePercentageError"
fn = mean_absolute_percentage_error
greater_is_better = False
self._symmetric=symmetric
super(MeanAbsolutePercentageError, self).__init__(
fn=fn, name=name, greater_is_better=greater_is_better
)
def __call__(self, y_true, y_pred, *args, **kwargs):
return self.fn(y_true, y_pred, *args, symmetric=self._symmetric, **kwargs)
# A user would need to choose value for symmetric arg when instantiating the loss metric
smape = MeanAbsolutePercentageError(symmetric=True)
smape(y_true, y_pred) I'm probably leaning toward option 2, since it mirrors how you'd do the same thing with the loss functions. What do you think works best with tune/evaluate and also is straight-forward for users? |
Hi @RNKuhns, another option may be to write custom classes with the arguments in the constructor: class MeanAbsolutePercentageError(MetricFunctionWrapper):
def __init__(self, symmetric=True):
self.symmetric = symmetric
name = "MeanAbsolutePercentageError"
fn = mean_absolute_percentage_error
greater_is_better = False
super(MeanAbsolutePercentageError, self).__init__(
fn=fn, name=name, greater_is_better=greater_is_better
)
def __call__(...):
return self.fn(..., symmetric=self.symmetric) EDIT: Just realised this is your option 2 above! |
Chipping in, I prefer your last option 2: metrics following a class template and being callable with a unified interface (no parameters in the call, always the same argument signature; parameters go in I would, however suggest:
|
@mloning and @fkiraly I started working on option 2 from above since it seemed to match with both of your expectations. I noted that I'd be re-using the same I started down the road of having a couple mix-ins to add the call desired For example, the mix-in and MeanAbsoluteError classes would look something like: class PercentageErrorMixIn:
def __call__(self, y_true, y_pred, *args, **kwargs):
return self.fn(y_true, y_pred, *args, symmetric=self.symmetric, **kwargs)
class MeanAbsolutePercentageError(PercentageErrorMixIn, MetricFunctionWrapper):
def __init__(self, symmetric):
name = "MeanAbsolutePercentageError"
fn = mean_absolute_percentage_error
greater_is_better = False
super().__init__(fn=fn, name=name, greater_is_better=greater_is_better)
self.symmetric=symmetric This could be extended to have a BaseMetricFunctionWrapper class and then additional classes that account for different types of metrics (e.g. ones that take a symmetric parameter or ones that take a square_root parameter, etc). For example: class BaseMetric:
def __init__(self, fn, name=None, greater_is_better=False):
self.fn = fn
self.name = name if name is not None else fn.__name__
self.greater_is_better = greater_is_better
class PercentageErrorMixIn:
def __call__(self, y_true, y_pred, *args, **kwargs):
return self.fn(y_true, y_pred, *args, symmetric=self.symmetric, **kwargs)
class PercentageMetricFunctionWrapper(PercentageErrorMixIn, BaseMetric):
def __init__(self, fn, name=None, greater_is_better=False, symmetric=False):
self.symmetric=symmetric
super().__init__(fn=fn, name=name, greater_is_better=greater_is_better)
class MeanAbsolutePercentageError(PercentageMetricFunctionWrapper):
def __init__(self, symmetric):
name = "MeanAbsolutePercentageError"
fn = mean_absolute_percentage_error
greater_is_better = False
super().__init__(fn=fn, name=name, greater_is_better=greater_is_better, symmetric=symmetric) But then I realized that another option is to just use the same CALL_KWARG_ATTRS = ['symmetric', 'square_root']
class MetricFunctionWrapper:
def __init__(self, fn, name=None, greater_is_better=False):
self.fn = fn
self.name = name if name is not None else fn.__name__
self.greater_is_better = greater_is_better
def __call__(self, y_true, y_pred):
kwargs = {}
for attr_str in CALL_KWARG_ATTRS :
self._add_kwarg(kwargs, attr_str)
return self.fn(y_true, y_pred, **kwargs)
def _add_kwarg(self, kwargs, attr_string):
if not isinstance(attr_str, str):
raise TypeError('Parameter `attr_str` must be a string')
attr_val = getattr(self, attr_string) if hasattr(self, attr_string) else None
return kwargs[attr_string] = attr_val This would yield a definition for a MeanAbsolutePercentageError class as: class MeanAbsolutePercentageError(MetricFunctionWrapper):
def __init__(self, symmetric):
name = "MeanAbsolutePercentageError"
fn = mean_absolute_percentage_error
greater_is_better = False
super().__init__(fn=fn, name=name, greater_is_better=greater_is_better)
self.symmetric=symmetric Interested to hear your thoughts again to help me learn more about the design choice side of things. |
@mloning I prefer the mixin approach over the kwarg approach. I also wouldn't mind having some duplication if it keeps things simpler. |
I actually think no kwargs should be passed to For the dispatch behaviour, I'd prefer mixins or inheritance, agreeing with @mloning. I also think you should inherit from |
Inheriting from BaseEstimator may cause problems. We may have to write our own base class. For example, CV objects aren't base estimators either. But try it out to see what happens! |
Hm, why that, what kind of problems, @mloning? |
@fkiraly yes, I'm thinking about testing and functionality around estimator registry (e.g. |
Sounds good. I'll give it a go and let you know later tonight if I'm seeing any issues. |
Fixes #671 by adding requested functionality.
This is a draft pull request. I need to finish the docstrings (I'm adding examples to docstrings) and also write unit tests but wanted to give you a heads up of the functionality.
The pull request adds new forecasting performance metrics (see list below) that can each work with multivariate series (but default to work with univariate forecasts). Also changes naming conventions to align with scikit-learn (e.g. mean_absoluate_error rather than mae or mae_loss).
The end result would be the following performance metric functions (each with a corresponding class version for scoring):
relative_error,
mean_asymmetric_error,
mean_absolute_scaled_error,
median_absolute_scaled_error,
mean_squared_scaled_error,
median_squared_scaled_error,
mean_absolute_error,
mean_squared_error,
median_absolute_error,
median_squared_error,
mean_absolute_percentage_error,
median_absolute_percentage_error,
mean_squared_percentage_error,
median_squared_percentage_error,
mean_relative_absolute_error,
median_relative_absolute_error,
geometric_mean_relative_absolute_error,
geometric_mean_relative_squared_error
Does your contribution introduce a new dependency? If yes, which one?
Not as coded. But several of the functions could be replaced by wrappers on scikit-learn functions. However, this would require changing the scikit-learn dependency to be >=0.24 which might be unneeded.
As coded, I use their code (need to ensure I have references to the places I do that) to avoid the dependency.
What should a reviewer concentrate their feedback on?
Let me know if the naming convention changes to the forecasting metrics makes sense. To me it makes sense to keep this as aligned with scikit-learn's metrics.
Note that I also added a few tweaks in the validation functions to support the ability to validate the input data to these functions.
Any other comments?
PR checklist
For all contributions
For new estimators