Add Additional Forecasting Performance Metrics #672

RNKuhns · 2021-02-11T22:22:25Z

Fixes #671 by adding requested functionality.

This is a draft pull request. I need to finish the docstrings (I'm adding examples to docstrings) and also write unit tests but wanted to give you a heads up of the functionality.

The pull request adds new forecasting performance metrics (see list below) that can each work with multivariate series (but default to work with univariate forecasts). Also changes naming conventions to align with scikit-learn (e.g. mean_absoluate_error rather than mae or mae_loss).

The end result would be the following performance metric functions (each with a corresponding class version for scoring):

relative_error,
mean_asymmetric_error,
mean_absolute_scaled_error,
median_absolute_scaled_error,
mean_squared_scaled_error,
median_squared_scaled_error,
mean_absolute_error,
mean_squared_error,
median_absolute_error,
median_squared_error,
mean_absolute_percentage_error,
median_absolute_percentage_error,
mean_squared_percentage_error,
median_squared_percentage_error,
mean_relative_absolute_error,
median_relative_absolute_error,
geometric_mean_relative_absolute_error,
geometric_mean_relative_squared_error

Does your contribution introduce a new dependency? If yes, which one?

Not as coded. But several of the functions could be replaced by wrappers on scikit-learn functions. However, this would require changing the scikit-learn dependency to be >=0.24 which might be unneeded.

As coded, I use their code (need to ensure I have references to the places I do that) to avoid the dependency.

What should a reviewer concentrate their feedback on?

Let me know if the naming convention changes to the forecasting metrics makes sense. To me it makes sense to keep this as aligned with scikit-learn's metrics.

Note that I also added a few tweaks in the validation functions to support the ability to validate the input data to these functions.

Any other comments?

PR checklist

For all contributions

I've added myself to the list of contributors.
Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
I've added unit tests and made sure they pass locally.

For new estimators

I've added the estimator to the online documentation.
I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

mloning · 2021-02-12T14:21:25Z

Thanks again @RNKuhns - happy to update the scikit-learn dependency so that we can reuse as much of their code as possible.

Regarding unit testing, I suggest to have

a general test for all functions/classes that checks the interface, e.g. input/output type (two pd.Series in, single float value out) and if user-friendly error messages are raised if wrong types are passed
specific tests for each function that checks the output value
a general test for all classes checking that they compute the same as the corresponding function

mloning · 2021-02-14T18:00:54Z

Let's rename y_test to y_true in all metrics (see #684)

RNKuhns · 2021-02-17T15:32:01Z

The testing and switching thing from y_test to y_true sound good to me. On the scikit-learn dependency, last I checked the Anaconda main channel was still on scikit-learn version 0.23.2. To make things easiest on people using conda, it might be worth waiting to enforce that dependency until they roll the Anaconda main channel to 0.24.

I've had some stuff pop up that got me a bit behind on this, but I'm expecting to have time to work on this on Friday.

RNKuhns · 2021-02-19T21:29:23Z

The y_test to y_true naming change is done and I'm wrapping up the tests.

@mloning I've got the test to verify the output values completed and I'm wrapping up the tests to compare the function outputs to the class outputs, but I've got a follow-up question on the first test you suggested input and output types).

I get how to check the output types, but what should I do to check the input types in terms of unit testing (I'm checking types in functions themselves).

mloning · 2021-02-19T22:12:15Z

@RNKuhns for input types, I'd just check if appropriate errors are raised, e.g. if two series are passed of unequal length or wrong types (e.g. lists instead of pandas Series), you can use pytest.raises for these tests

Other minor tweaks to forecasting functions that were identified as part of unit testign.

Fixed __init__ in all classes.

Verification applies to relative error functions.

Since naming convention of forecasting performance metrics was changed, the old references to things like sMAPE and smape_loss had to be updated.

RNKuhns · 2021-02-23T05:01:11Z

@mloning I believe I've got this pretty close to finished. I've added the discussed unit tests, as well as a few more. types of unit tests and they are passing locally. Eventually it would be good to add unit tests of the multivariate functionality of the functions and also the ability to provide forecast horizon weights to return weighted metrics. I'm using the same coding approach as sklearn for both, but I haven't written unit tests for that functionality yet (by default it isn't used). Is it fine to leave those for later or do you want that functionality tested now too?

I've got the tests being used for each of the new functions (and I'm testing equality with the performance metric classes) for all the functions except for relative_loss() and mean_asymmetric_error().

The output of these both depend on user choices (e.g the loss function to use in relative loss and in mean_asymmetric_error the asymmetric threshold and functions to use on either side of this threshold). I'll add a these to the test cases for a specific set of choices for the other arguments to each function.

Note that I also haven't created a class corresponding to either of these functions. If the user wanted to use these, they'd need to use make_forecasting_scorer and supply either function with the desired arguments.

Can you let me know if not having corresponding classes for these two functions and also just testing a 1 specific set of arguments for each of these functions will work.

I've also gone through and changed any references to the old naming conventions (e.g. sMAPE, smape_loss, etc) to the new naming conventions (e.g. SymmetricMeanAbsolutePercentageError and symmetric_mean_absolute_percentage_error).

Note that I still need to finalize the docstrings for each of the new forecasting loss functions (they are close but they are not finished yet).

mloning · 2021-02-23T18:56:53Z

sktime/forecasting/all/__init__.py

+    MedianAbsoluteError,
+    MedianSquaredError,
+    RootMedianSquaredError,
+    SymmetricMeanAbsolutePercentageError,


Ah I see the problem of having abbreviations (sMAPEmay refer to mean or median). Is there any other way to handle this? Or could we include the abbreviations still even if we just redefine them sMAPE = SymmetricMeanAbsoluteError, but duplication also isn't great ...

Hey @mloning can you elaborate on your comment?

I thought about using abbreviation sMAPE and sMdAPE, but I think we should match the metric naming style of scikit-learn regression metrics. Scikit-learn has functions that calculate MSE, MAPE and MdAE and are named mean_square_error, mean_absolute_percentage_error and median_absolute_error.

Also in terms of duplication do you mean having a separate function for the median and mean metrics or something else?

I actually coded a symmetric=False default argument into the mean_absolute_percentage_error function and then have symmetric_mean_absolute_percentage_error as a wrapper around that implementation. I could go either way in terms whether we exclude the symmetric_mean_absolute_percentage_error function and just let users toggle the symmetric argument in mean_absolute_percentage_error based on their preference.

In terms of the separate functions for mean and median, I considered whether to include a aggregation_function parameter that let people choose {'max', 'mean', 'median'}. That would reduce the number of functions, but it would lead to a more ambiguous name (we'd have to name the function something like root_squared_error and then let people chose whether it is mean, median or max version). It would also move away from using names that match the scikit-learn regression metric naming style.

Okay @RNKuhns I'm happy with using the long names, makes it more explicit.

We could simply add shorthands for common loss functions, e.g. smape = symmetric_mean_absolute_error so that users can also import smape - but not sure if the benefit of having shorter function names for common errors outweighs the disadvantage of introducing two ways of doing the same thing

I agree, using arguments like {'max', 'mean', 'median'} isn't very intuitive.

mloning · 2021-02-25T15:18:41Z

There's one questions which I'm not sure about: how to handle y_train? I see two options:

All loss functions accept y_train as an optional input argument, but only those that need it use it (e.g. the scaled errors),
Only those that need it have it as a required input argument.

We can see already in the unit tests that option 2 requires us to always make a case distinction. A similar distinction will be required in all functionality related to model evaluation (e.g. ForecastingGridSearchCV).

The loss classes already have a few attributes associated with them (e.g. greater_is_better) to handle evaluation. If we go with option 2, my suggestion would be to add a requires_y_train attribute to make the case distinctions more straightforward.

Thoughts @fkiraly @aiwalter @RNKuhns?

RNKuhns · 2021-04-06T02:26:05Z

@mloning I think I'm close to wrapping up the changes. Like we talked about I am using a square_root boolean parameter in the squared error function and symmetric boolean parameter in the percentage error functions.

Last thing I'm trying to figure out is whether we should still have separate classes in sktime/sktime/performance_metrics/_classes.py for the loss metrics handled by the square_root and symmetric arguments.

This is probably best illustrated with an example. With these changes the API for mean_absolute_percentage_error looks like the following:

def mean_absolute_percentage_error(
    y_true, y_pred, horizon_weight=None, multioutput="uniform_average", 
    symmetric=False
):
    ...

One option would be to create a single class as an object-oriented counterpart to the above function, that would look something like:

class MeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super(MeanAbsolutePercentageError, self).__init__(
            fn=fn, name=name, greater_is_better=greater_is_better
        )

Where the callability is inherited from the baseMetricFunctionWrapper object.

class MetricFunctionWrapper:
    def __init__(self, fn, name=None, greater_is_better=False):
        self.fn = fn
        self.name = name if name is not None else fn.__name__
        self.greater_is_better = greater_is_better

    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args, **kwargs)

To return the symmetric version of MeanAbsolutePercentageError the call would have to be:

MeanAbsolutePercentageError(y_true, y_pred, symmetric=True)

but that seems like it would be awkward for tuning/evaluation (or am I overthinking it?). The other two options I've thought of are:

# Option 1: Have seperate classes for symmetric and regular mean absolute percentage error
#                 Specify symmetric=True in __call__ defined within the class
class SymmetricMeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super(MeanAbsolutePercentageError, self).__init__(
            fn=fn, name=name, greater_is_better=greater_is_better
        )
    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args,  symmetric=True, **kwargs)

# This would let a user just use the SymmetricMeanAbsolutePercentageError class and not think about it
smape = SymmetricMeanAbsolutePercentageError()
smape(y_true, y_pred)

# Option 2: Have __init__ accept symmetric kwarg and store in self._symmetric attribute
#                 Specify symmetric=self._symmetric in __call__ defined within the class
class MeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self, symmetric=False):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        self._symmetric=symmetric
        super(MeanAbsolutePercentageError, self).__init__(
            fn=fn, name=name, greater_is_better=greater_is_better
        )
    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args,  symmetric=self._symmetric, **kwargs)

# A user would need to choose value for symmetric arg when instantiating the loss metric
smape = MeanAbsolutePercentageError(symmetric=True)
smape(y_true, y_pred)

I'm probably leaning toward option 2, since it mirrors how you'd do the same thing with the loss functions. What do you think works best with tune/evaluate and also is straight-forward for users?

mloning · 2021-04-06T10:34:08Z

Hi @RNKuhns, another option may be to write custom classes with the arguments in the constructor:

class MeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self, symmetric=True):
        self.symmetric = symmetric
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super(MeanAbsolutePercentageError, self).__init__(
            fn=fn, name=name, greater_is_better=greater_is_better
        )
    def __call__(...):
        return self.fn(..., symmetric=self.symmetric)

EDIT: Just realised this is your option 2 above!

fkiraly · 2021-04-06T10:34:27Z

Chipping in, I prefer your last option 2: metrics following a class template and being callable with a unified interface (no parameters in the call, always the same argument signature; parameters go in __init__). This would be most consistent with the "universal interface" principle (see section 6.2 of https://arxiv.org/pdf/2101.04938.pdf)

I would, however suggest:

additionally inherit from sklearn's BaseEstimator. Perhaps best that a BaseMetric inherits from BaseEstimator, and metrics inherit from the BaseMetric template. That way, you can tune parameters of a metric used within a tuning wrapper (otherwise you can't easily).
think carefully about how to handle different call signatures across tasks. Do we need different base classes for metrics serving different tasks?
should we consider creating metrics where the aggregation method is a parameter? E.g., AbsolutePercentageError with a parameter agg that can be mean, median, etc.

fkiraly · 2021-04-06T11:15:06Z

Hi @RNKuhns, another option may be to write custom classes with the arguments in the constructor

@mloning, what's the advantage of that? Not sure I see them.

mloning · 2021-04-06T11:30:05Z

@fkiraly I didn't read properly, it's the same as @RNKuhns's option 2 above

RNKuhns · 2021-04-08T14:27:17Z

@mloning and @fkiraly I started working on option 2 from above since it seemed to match with both of your expectations.

I noted that I'd be re-using the same __call__ definitions for squared errors and percentage errors multiple times (I think fkiraly was alluding to this).

I started down the road of having a couple mix-ins to add the call desired __call__ functionality.

For example, the mix-in and MeanAbsoluteError classes would look something like:

class PercentageErrorMixIn:
    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args, symmetric=self.symmetric, **kwargs)

class MeanAbsolutePercentageError(PercentageErrorMixIn, MetricFunctionWrapper):
    def __init__(self, symmetric):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super().__init__(fn=fn, name=name, greater_is_better=greater_is_better)
        self.symmetric=symmetric

This could be extended to have a BaseMetricFunctionWrapper class and then additional classes that account for different types of metrics (e.g. ones that take a symmetric parameter or ones that take a square_root parameter, etc).

For example:

class BaseMetric:
    def __init__(self, fn, name=None, greater_is_better=False):
        self.fn = fn
        self.name = name if name is not None else fn.__name__
        self.greater_is_better = greater_is_better

class PercentageErrorMixIn:
    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args, symmetric=self.symmetric, **kwargs)

class PercentageMetricFunctionWrapper(PercentageErrorMixIn, BaseMetric):
    def __init__(self, fn, name=None, greater_is_better=False, symmetric=False):
        self.symmetric=symmetric
        super().__init__(fn=fn, name=name, greater_is_better=greater_is_better)

class MeanAbsolutePercentageError(PercentageMetricFunctionWrapper):
    def __init__(self, symmetric):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super().__init__(fn=fn, name=name, greater_is_better=greater_is_better, symmetric=symmetric)

But then I realized that another option is to just use the same __call__ definition in MetricFunctionWrapper for all the loss function classes, but have __call__ definition build a kwargs dict to pass to the underlying function based on the attributes that the class has (e.g. if hasattr(self, symmetric) then pass symmetric to function in the __call__ return).

CALL_KWARG_ATTRS = ['symmetric', 'square_root']
class MetricFunctionWrapper:
    def __init__(self, fn, name=None, greater_is_better=False):
        self.fn = fn
        self.name = name if name is not None else fn.__name__
        self.greater_is_better = greater_is_better

    def __call__(self, y_true, y_pred):
        kwargs = {}
        for attr_str in CALL_KWARG_ATTRS :
            self._add_kwarg(kwargs, attr_str)

        return self.fn(y_true, y_pred, **kwargs)

    def _add_kwarg(self, kwargs, attr_string):
        if not isinstance(attr_str, str):
            raise TypeError('Parameter `attr_str` must be a string')
        attr_val = getattr(self, attr_string) if hasattr(self, attr_string) else None
        return kwargs[attr_string] = attr_val

This would yield a definition for a MeanAbsolutePercentageError class as:

class MeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self, symmetric):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super().__init__(fn=fn, name=name, greater_is_better=greater_is_better)
        self.symmetric=symmetric

Interested to hear your thoughts again to help me learn more about the design choice side of things.

mloning · 2021-04-09T13:01:37Z

@mloning I prefer the mixin approach over the kwarg approach. I also wouldn't mind having some duplication if it keeps things simpler.

fkiraly · 2021-04-09T13:27:23Z

I actually think no kwargs should be passed to __call__, the signature should be fixed. Any modifying or parameter behaviour should be passed through __init__.

For the dispatch behaviour, I'd prefer mixins or inheritance, agreeing with @mloning.

I also think you should inherit from BaseEstimator for parameter handling (see above).

mloning · 2021-04-09T14:26:34Z

Inheriting from BaseEstimator may cause problems. We may have to write our own base class. For example, CV objects aren't base estimators either. But try it out to see what happens!

fkiraly · 2021-04-09T16:43:57Z

Inheriting from BaseEstimator may cause problems.

Hm, why that, what kind of problems, @mloning?
In my private projects I keep inheriting from BaseEstimator for nested parameter inspection in inhomogenous composites (even those with interfaces wildly different to fit/predict), and have not encountered any serious issues.
Perhaps testing or lookup?

mloning · 2021-04-09T16:57:10Z

@fkiraly yes, I'm thinking about testing and functionality around estimator registry (e.g. all_estimators(...)), let's see what happens - if it works, we should also let our CV classes inherit from BaseEstimator I guess.

RNKuhns · 2021-04-09T17:09:55Z

Sounds good. I'll give it a go and let you know later tonight if I'm seeing any issues.

MixIn

mloning · 2021-04-11T09:11:58Z

Hi @RNKuhns - there seem to be quite a few merge conflicts, possibly related to the renaming of the master branch (#644).

This should fix your local clone and fork:

git branch -m master main
git fetch origin
git branch -u origin/main main

Let me know if you need help resolving the conflicts.

RNKuhns added 7 commits December 26, 2020 11:30

Changes to pass pre-commit workflow

aecc0a7

Merge branch 'master' of https://github.com/alan-turing-institute/sktime

3680e24

Added unit tests and reworked new functions

a195474

Added name to contributorsrc and minor fixes

ad31048

Updated instlalation.rst Windows section

3f055fc

Merge remote-tracking branch 'upstream/master'

6449f26

New functions in performance_metrics.forecasting

dc83b19

RNKuhns requested a review from mloning as a code owner February 11, 2021 22:22

RNKuhns marked this pull request as draft February 11, 2021 22:22

RNKuhns added 11 commits February 21, 2021 14:15

Added root median squared scaled error.

adc7013

Other minor tweaks to forecasting functions that were identified as part of unit testign.

Added RootMedianSquaredScaledError.

58991d7

Fixed __init__ in all classes.

Removed RelativeLoss and MeanAsymmetricError

df8d20d

Add root_median_squared_scaled_error to __all__

cdd2da4

Added root_median_squared_scaled_error to __all__

da92a34

Changed y_test to y_true in check_y_true_pred

b9c1f96

Added validation to y_pred_benchmarks.

fce95d7

Verification applies to relative error functions.

Updated tests of new forecasting metrcs

6654db9

Updated module references new performance metrics

2dd2080

Since naming convention of forecasting performance metrics was changed, the old references to things like sMAPE and smape_loss had to be updated.

Updated to reference new forecasting metrics

94d1bb7

Merge remote-tracking branch 'upstream/master'

7bf3492

mloning reviewed Feb 23, 2021

View reviewed changes

Moved check_horizon_weights to utils.validation

0e465dc

RNKuhns added 7 commits April 10, 2021 10:11

Streamlined loss functions and classes

3ae1887

Use sklearn to validate y_train/y_pred benchmark

c99392a

updated mean_asymmetric_error docstring

424acc8

Added MeanAsymmetricError class and corresponding

0e940ff

MixIn

Updated based on performance metric changes

885242c

Updated unit tests based on changes to metrics

1d1e6d6

Merge branch 'master' of https://github.com/RNKuhns/sktime

c1453c4

RNKuhns closed this Apr 12, 2021

RNKuhns deleted the master branch April 12, 2021 12:05

RNKuhns restored the master branch April 12, 2021 12:53

RNKuhns reopened this Apr 12, 2021

RNKuhns closed this Apr 12, 2021

RNKuhns deleted the master branch April 12, 2021 13:00

RNKuhns mentioned this pull request Apr 12, 2021

New forecasting metrics #801

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Additional Forecasting Performance Metrics #672

Add Additional Forecasting Performance Metrics #672

RNKuhns commented Feb 11, 2021 •

edited

Loading

mloning commented Feb 12, 2021 •

edited

Loading

mloning commented Feb 14, 2021

RNKuhns commented Feb 17, 2021

RNKuhns commented Feb 19, 2021

mloning commented Feb 19, 2021

RNKuhns commented Feb 23, 2021

mloning Feb 23, 2021

RNKuhns Feb 24, 2021

mloning Feb 25, 2021

mloning Feb 25, 2021 •

edited

Loading

mloning Feb 25, 2021

mloning commented Feb 25, 2021 •

edited

Loading

RNKuhns commented Apr 6, 2021 •

edited

Loading

mloning commented Apr 6, 2021 •

edited

Loading

fkiraly commented Apr 6, 2021 •

edited

Loading

fkiraly commented Apr 6, 2021

mloning commented Apr 6, 2021 •

edited

Loading

RNKuhns commented Apr 8, 2021

mloning commented Apr 9, 2021

fkiraly commented Apr 9, 2021

mloning commented Apr 9, 2021 •

edited

Loading

fkiraly commented Apr 9, 2021 •

edited

Loading

mloning commented Apr 9, 2021

RNKuhns commented Apr 9, 2021

mloning commented Apr 11, 2021

Add Additional Forecasting Performance Metrics #672

Add Additional Forecasting Performance Metrics #672

Conversation

RNKuhns commented Feb 11, 2021 • edited Loading

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions

For new estimators

mloning commented Feb 12, 2021 • edited Loading

mloning commented Feb 14, 2021

RNKuhns commented Feb 17, 2021

RNKuhns commented Feb 19, 2021

mloning commented Feb 19, 2021

RNKuhns commented Feb 23, 2021

mloning Feb 23, 2021

Choose a reason for hiding this comment

RNKuhns Feb 24, 2021

Choose a reason for hiding this comment

mloning Feb 25, 2021

Choose a reason for hiding this comment

mloning Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

mloning Feb 25, 2021

Choose a reason for hiding this comment

mloning commented Feb 25, 2021 • edited Loading

RNKuhns commented Apr 6, 2021 • edited Loading

mloning commented Apr 6, 2021 • edited Loading

fkiraly commented Apr 6, 2021 • edited Loading

fkiraly commented Apr 6, 2021

mloning commented Apr 6, 2021 • edited Loading

RNKuhns commented Apr 8, 2021

mloning commented Apr 9, 2021

fkiraly commented Apr 9, 2021

mloning commented Apr 9, 2021 • edited Loading

fkiraly commented Apr 9, 2021 • edited Loading

mloning commented Apr 9, 2021

RNKuhns commented Apr 9, 2021

mloning commented Apr 11, 2021

RNKuhns commented Feb 11, 2021 •

edited

Loading

mloning commented Feb 12, 2021 •

edited

Loading

mloning Feb 25, 2021 •

edited

Loading

mloning commented Feb 25, 2021 •

edited

Loading

RNKuhns commented Apr 6, 2021 •

edited

Loading

mloning commented Apr 6, 2021 •

edited

Loading

fkiraly commented Apr 6, 2021 •

edited

Loading

mloning commented Apr 6, 2021 •

edited

Loading

mloning commented Apr 9, 2021 •

edited

Loading

fkiraly commented Apr 9, 2021 •

edited

Loading