Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Additional Forecasting Performance Metrics #672

Closed
wants to merge 29 commits into from

Conversation

RNKuhns
Copy link
Contributor

@RNKuhns RNKuhns commented Feb 11, 2021

Fixes #671 by adding requested functionality.

This is a draft pull request. I need to finish the docstrings (I'm adding examples to docstrings) and also write unit tests but wanted to give you a heads up of the functionality.

The pull request adds new forecasting performance metrics (see list below) that can each work with multivariate series (but default to work with univariate forecasts). Also changes naming conventions to align with scikit-learn (e.g. mean_absoluate_error rather than mae or mae_loss).

The end result would be the following performance metric functions (each with a corresponding class version for scoring):

relative_error,
mean_asymmetric_error,
mean_absolute_scaled_error,
median_absolute_scaled_error,
mean_squared_scaled_error,
median_squared_scaled_error,
mean_absolute_error,
mean_squared_error,
median_absolute_error,
median_squared_error,
mean_absolute_percentage_error,
median_absolute_percentage_error,
mean_squared_percentage_error,
median_squared_percentage_error,
mean_relative_absolute_error,
median_relative_absolute_error,
geometric_mean_relative_absolute_error,
geometric_mean_relative_squared_error

Does your contribution introduce a new dependency? If yes, which one?

Not as coded. But several of the functions could be replaced by wrappers on scikit-learn functions. However, this would require changing the scikit-learn dependency to be >=0.24 which might be unneeded.

As coded, I use their code (need to ensure I have references to the places I do that) to avoid the dependency.

What should a reviewer concentrate their feedback on?

Let me know if the naming convention changes to the forecasting metrics makes sense. To me it makes sense to keep this as aligned with scikit-learn's metrics.

Note that I also added a few tweaks in the validation functions to support the ability to validate the input data to these functions.

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • I've added unit tests and made sure they pass locally.
For new estimators
  • I've added the estimator to the online documentation.
  • I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

@RNKuhns RNKuhns marked this pull request as draft February 11, 2021 22:22
@mloning
Copy link
Contributor

mloning commented Feb 12, 2021

Thanks again @RNKuhns - happy to update the scikit-learn dependency so that we can reuse as much of their code as possible.

Regarding unit testing, I suggest to have

  • a general test for all functions/classes that checks the interface, e.g. input/output type (two pd.Series in, single float value out) and if user-friendly error messages are raised if wrong types are passed
  • specific tests for each function that checks the output value
  • a general test for all classes checking that they compute the same as the corresponding function

@mloning
Copy link
Contributor

mloning commented Feb 14, 2021

Let's rename y_test to y_true in all metrics (see #684)

@RNKuhns
Copy link
Contributor Author

RNKuhns commented Feb 17, 2021

The testing and switching thing from y_test to y_true sound good to me. On the scikit-learn dependency, last I checked the Anaconda main channel was still on scikit-learn version 0.23.2. To make things easiest on people using conda, it might be worth waiting to enforce that dependency until they roll the Anaconda main channel to 0.24.

I've had some stuff pop up that got me a bit behind on this, but I'm expecting to have time to work on this on Friday.

@RNKuhns
Copy link
Contributor Author

RNKuhns commented Feb 19, 2021

The y_test to y_true naming change is done and I'm wrapping up the tests.

@mloning I've got the test to verify the output values completed and I'm wrapping up the tests to compare the function outputs to the class outputs, but I've got a follow-up question on the first test you suggested input and output types).

I get how to check the output types, but what should I do to check the input types in terms of unit testing (I'm checking types in functions themselves).

@mloning
Copy link
Contributor

mloning commented Feb 19, 2021

@RNKuhns for input types, I'd just check if appropriate errors are raised, e.g. if two series are passed of unequal length or wrong types (e.g. lists instead of pandas Series), you can use pytest.raises for these tests

@RNKuhns
Copy link
Contributor Author

RNKuhns commented Feb 23, 2021

@mloning I believe I've got this pretty close to finished. I've added the discussed unit tests, as well as a few more. types of unit tests and they are passing locally. Eventually it would be good to add unit tests of the multivariate functionality of the functions and also the ability to provide forecast horizon weights to return weighted metrics. I'm using the same coding approach as sklearn for both, but I haven't written unit tests for that functionality yet (by default it isn't used). Is it fine to leave those for later or do you want that functionality tested now too?

I've got the tests being used for each of the new functions (and I'm testing equality with the performance metric classes) for all the functions except for relative_loss() and mean_asymmetric_error().

The output of these both depend on user choices (e.g the loss function to use in relative loss and in mean_asymmetric_error the asymmetric threshold and functions to use on either side of this threshold). I'll add a these to the test cases for a specific set of choices for the other arguments to each function.

Note that I also haven't created a class corresponding to either of these functions. If the user wanted to use these, they'd need to use make_forecasting_scorer and supply either function with the desired arguments.

Can you let me know if not having corresponding classes for these two functions and also just testing a 1 specific set of arguments for each of these functions will work.

I've also gone through and changed any references to the old naming conventions (e.g. sMAPE, smape_loss, etc) to the new naming conventions (e.g. SymmetricMeanAbsolutePercentageError and symmetric_mean_absolute_percentage_error).

Note that I still need to finalize the docstrings for each of the new forecasting loss functions (they are close but they are not finished yet).

MedianAbsoluteError,
MedianSquaredError,
RootMedianSquaredError,
SymmetricMeanAbsolutePercentageError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the problem of having abbreviations (sMAPEmay refer to mean or median). Is there any other way to handle this? Or could we include the abbreviations still even if we just redefine them sMAPE = SymmetricMeanAbsoluteError, but duplication also isn't great ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mloning can you elaborate on your comment?

I thought about using abbreviation sMAPE and sMdAPE, but I think we should match the metric naming style of scikit-learn regression metrics. Scikit-learn has functions that calculate MSE, MAPE and MdAE and are named mean_square_error, mean_absolute_percentage_error and median_absolute_error.

Also in terms of duplication do you mean having a separate function for the median and mean metrics or something else?

I actually coded a symmetric=False default argument into the mean_absolute_percentage_error function and then have symmetric_mean_absolute_percentage_error as a wrapper around that implementation. I could go either way in terms whether we exclude the symmetric_mean_absolute_percentage_error function and just let users toggle the symmetric argument in mean_absolute_percentage_error based on their preference.

In terms of the separate functions for mean and median, I considered whether to include a aggregation_function parameter that let people choose {'max', 'mean', 'median'}. That would reduce the number of functions, but it would lead to a more ambiguous name (we'd have to name the function something like root_squared_error and then let people chose whether it is mean, median or max version). It would also move away from using names that match the scikit-learn regression metric naming style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay @RNKuhns I'm happy with using the long names, makes it more explicit.

Copy link
Contributor

@mloning mloning Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could simply add shorthands for common loss functions, e.g. smape = symmetric_mean_absolute_error so that users can also import smape - but not sure if the benefit of having shorter function names for common errors outweighs the disadvantage of introducing two ways of doing the same thing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, using arguments like {'max', 'mean', 'median'} isn't very intuitive.

@mloning
Copy link
Contributor

mloning commented Feb 25, 2021

There's one questions which I'm not sure about: how to handle y_train? I see two options:

  1. All loss functions accept y_train as an optional input argument, but only those that need it use it (e.g. the scaled errors),
  2. Only those that need it have it as a required input argument.

We can see already in the unit tests that option 2 requires us to always make a case distinction. A similar distinction will be required in all functionality related to model evaluation (e.g. ForecastingGridSearchCV).

The loss classes already have a few attributes associated with them (e.g. greater_is_better) to handle evaluation. If we go with option 2, my suggestion would be to add a requires_y_train attribute to make the case distinctions more straightforward.

Thoughts @fkiraly @aiwalter @RNKuhns?

@RNKuhns
Copy link
Contributor Author

RNKuhns commented Apr 6, 2021

@mloning I think I'm close to wrapping up the changes. Like we talked about I am using a square_root boolean parameter in the squared error function and symmetric boolean parameter in the percentage error functions.

Last thing I'm trying to figure out is whether we should still have separate classes in sktime/sktime/performance_metrics/_classes.py for the loss metrics handled by the square_root and symmetric arguments.

This is probably best illustrated with an example. With these changes the API for mean_absolute_percentage_error looks like the following:

def mean_absolute_percentage_error(
    y_true, y_pred, horizon_weight=None, multioutput="uniform_average", 
    symmetric=False
):
    ... 

One option would be to create a single class as an object-oriented counterpart to the above function, that would look something like:

class MeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super(MeanAbsolutePercentageError, self).__init__(
            fn=fn, name=name, greater_is_better=greater_is_better
        )

Where the callability is inherited from the baseMetricFunctionWrapper object.

class MetricFunctionWrapper:
    def __init__(self, fn, name=None, greater_is_better=False):
        self.fn = fn
        self.name = name if name is not None else fn.__name__
        self.greater_is_better = greater_is_better

    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args, **kwargs)

To return the symmetric version of MeanAbsolutePercentageError the call would have to be:

MeanAbsolutePercentageError(y_true, y_pred, symmetric=True)

but that seems like it would be awkward for tuning/evaluation (or am I overthinking it?). The other two options I've thought of are:

# Option 1: Have seperate classes for symmetric and regular mean absolute percentage error
#                 Specify symmetric=True in __call__ defined within the class
class SymmetricMeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super(MeanAbsolutePercentageError, self).__init__(
            fn=fn, name=name, greater_is_better=greater_is_better
        )
    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args,  symmetric=True, **kwargs)

# This would let a user just use the SymmetricMeanAbsolutePercentageError class and not think about it
smape = SymmetricMeanAbsolutePercentageError()
smape(y_true, y_pred)

# Option 2: Have __init__ accept symmetric kwarg and store in self._symmetric attribute
#                 Specify symmetric=self._symmetric in __call__ defined within the class
class MeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self, symmetric=False):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        self._symmetric=symmetric
        super(MeanAbsolutePercentageError, self).__init__(
            fn=fn, name=name, greater_is_better=greater_is_better
        )
    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args,  symmetric=self._symmetric, **kwargs)

# A user would need to choose value for symmetric arg when instantiating the loss metric
smape = MeanAbsolutePercentageError(symmetric=True)
smape(y_true, y_pred)

I'm probably leaning toward option 2, since it mirrors how you'd do the same thing with the loss functions. What do you think works best with tune/evaluate and also is straight-forward for users?

@mloning
Copy link
Contributor

mloning commented Apr 6, 2021

Hi @RNKuhns, another option may be to write custom classes with the arguments in the constructor:

class MeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self, symmetric=True):
        self.symmetric = symmetric
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super(MeanAbsolutePercentageError, self).__init__(
            fn=fn, name=name, greater_is_better=greater_is_better
        )
    def __call__(...):
        return self.fn(..., symmetric=self.symmetric)

EDIT: Just realised this is your option 2 above!

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 6, 2021

Chipping in, I prefer your last option 2: metrics following a class template and being callable with a unified interface (no parameters in the call, always the same argument signature; parameters go in __init__). This would be most consistent with the "universal interface" principle (see section 6.2 of https://arxiv.org/pdf/2101.04938.pdf)

I would, however suggest:

  • additionally inherit from sklearn's BaseEstimator. Perhaps best that a BaseMetric inherits from BaseEstimator, and metrics inherit from the BaseMetric template. That way, you can tune parameters of a metric used within a tuning wrapper (otherwise you can't easily).
  • think carefully about how to handle different call signatures across tasks. Do we need different base classes for metrics serving different tasks?
  • should we consider creating metrics where the aggregation method is a parameter? E.g., AbsolutePercentageError with a parameter agg that can be mean, median, etc.

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 6, 2021

Hi @RNKuhns, another option may be to write custom classes with the arguments in the constructor

@mloning, what's the advantage of that? Not sure I see them.

@mloning
Copy link
Contributor

mloning commented Apr 6, 2021

@fkiraly I didn't read properly, it's the same as @RNKuhns's option 2 above

@RNKuhns
Copy link
Contributor Author

RNKuhns commented Apr 8, 2021

@mloning and @fkiraly I started working on option 2 from above since it seemed to match with both of your expectations.

I noted that I'd be re-using the same __call__ definitions for squared errors and percentage errors multiple times (I think fkiraly was alluding to this).

I started down the road of having a couple mix-ins to add the call desired __call__ functionality.

For example, the mix-in and MeanAbsoluteError classes would look something like:

class PercentageErrorMixIn:
    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args, symmetric=self.symmetric, **kwargs)

class MeanAbsolutePercentageError(PercentageErrorMixIn, MetricFunctionWrapper):
    def __init__(self, symmetric):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super().__init__(fn=fn, name=name, greater_is_better=greater_is_better)
        self.symmetric=symmetric

This could be extended to have a BaseMetricFunctionWrapper class and then additional classes that account for different types of metrics (e.g. ones that take a symmetric parameter or ones that take a square_root parameter, etc).

For example:

class BaseMetric:
    def __init__(self, fn, name=None, greater_is_better=False):
        self.fn = fn
        self.name = name if name is not None else fn.__name__
        self.greater_is_better = greater_is_better

class PercentageErrorMixIn:
    def __call__(self, y_true, y_pred, *args, **kwargs):
        return self.fn(y_true, y_pred, *args, symmetric=self.symmetric, **kwargs)

class PercentageMetricFunctionWrapper(PercentageErrorMixIn, BaseMetric):
    def __init__(self, fn, name=None, greater_is_better=False, symmetric=False):
        self.symmetric=symmetric
        super().__init__(fn=fn, name=name, greater_is_better=greater_is_better)

class MeanAbsolutePercentageError(PercentageMetricFunctionWrapper):
    def __init__(self, symmetric):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super().__init__(fn=fn, name=name, greater_is_better=greater_is_better, symmetric=symmetric)

But then I realized that another option is to just use the same __call__ definition in MetricFunctionWrapper for all the loss function classes, but have __call__ definition build a kwargs dict to pass to the underlying function based on the attributes that the class has (e.g. if hasattr(self, symmetric) then pass symmetric to function in the __call__ return).

CALL_KWARG_ATTRS = ['symmetric', 'square_root']
class MetricFunctionWrapper:
    def __init__(self, fn, name=None, greater_is_better=False):
        self.fn = fn
        self.name = name if name is not None else fn.__name__
        self.greater_is_better = greater_is_better

    def __call__(self, y_true, y_pred):
        kwargs = {}
        for attr_str in CALL_KWARG_ATTRS :
            self._add_kwarg(kwargs, attr_str)

        return self.fn(y_true, y_pred, **kwargs)

    def _add_kwarg(self, kwargs, attr_string):
        if not isinstance(attr_str, str):
            raise TypeError('Parameter `attr_str` must be a string')
        attr_val = getattr(self, attr_string) if hasattr(self, attr_string) else None
        return kwargs[attr_string] = attr_val

This would yield a definition for a MeanAbsolutePercentageError class as:

class MeanAbsolutePercentageError(MetricFunctionWrapper):
    def __init__(self, symmetric):
        name = "MeanAbsolutePercentageError"
        fn = mean_absolute_percentage_error
        greater_is_better = False
        super().__init__(fn=fn, name=name, greater_is_better=greater_is_better)
        self.symmetric=symmetric

Interested to hear your thoughts again to help me learn more about the design choice side of things.

@mloning
Copy link
Contributor

mloning commented Apr 9, 2021

@mloning I prefer the mixin approach over the kwarg approach. I also wouldn't mind having some duplication if it keeps things simpler.

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 9, 2021

I actually think no kwargs should be passed to __call__, the signature should be fixed. Any modifying or parameter behaviour should be passed through __init__.

For the dispatch behaviour, I'd prefer mixins or inheritance, agreeing with @mloning.

I also think you should inherit from BaseEstimator for parameter handling (see above).

@mloning
Copy link
Contributor

mloning commented Apr 9, 2021

Inheriting from BaseEstimator may cause problems. We may have to write our own base class. For example, CV objects aren't base estimators either. But try it out to see what happens!

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 9, 2021

Inheriting from BaseEstimator may cause problems.

Hm, why that, what kind of problems, @mloning?
In my private projects I keep inheriting from BaseEstimator for nested parameter inspection in inhomogenous composites (even those with interfaces wildly different to fit/predict), and have not encountered any serious issues.
Perhaps testing or lookup?

@mloning
Copy link
Contributor

mloning commented Apr 9, 2021

@fkiraly yes, I'm thinking about testing and functionality around estimator registry (e.g. all_estimators(...)), let's see what happens - if it works, we should also let our CV classes inherit from BaseEstimator I guess.

@RNKuhns
Copy link
Contributor Author

RNKuhns commented Apr 9, 2021

Sounds good. I'll give it a go and let you know later tonight if I'm seeing any issues.

@mloning
Copy link
Contributor

mloning commented Apr 11, 2021

Hi @RNKuhns - there seem to be quite a few merge conflicts, possibly related to the renaming of the master branch (#644).

This should fix your local clone and fork:

git branch -m master main
git fetch origin
git branch -u origin/main main

Let me know if you need help resolving the conflicts.

@RNKuhns RNKuhns closed this Apr 12, 2021
@RNKuhns RNKuhns deleted the master branch April 12, 2021 12:05
@RNKuhns RNKuhns restored the master branch April 12, 2021 12:53
@RNKuhns RNKuhns reopened this Apr 12, 2021
@RNKuhns RNKuhns closed this Apr 12, 2021
@RNKuhns RNKuhns deleted the master branch April 12, 2021 13:00
@RNKuhns RNKuhns mentioned this pull request Apr 12, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Additional Forecasting Evaluation Functions/Classes
3 participants