Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] further improvements on BaseDistribution and tests #21

Open
fkiraly opened this issue Aug 23, 2023 · 8 comments
Open

[ENH] further improvements on BaseDistribution and tests #21

fkiraly opened this issue Aug 23, 2023 · 8 comments
Labels
enhancement implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality module:probability&simulation probability distributions and simulators

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 23, 2023

From implementing a few distributions, some insights on potential improvements:

  • the broadcasting logic could/should be abstracted to apply to a certain subset of parameters rather than be copy-pasted in every class
  • there is repeated boilerplate, this could be resolved via split into public/private methods such as pdf/_pdf, similar to sktime's fit/_fit
  • we should add logical consistency tests, e.g., cdf being inverse to ppf, or pdf and log_pdf being compatible, or more sophisticated tests such as subtracting the mean shifting functions (or not affecting them) accordingly
  • worth thinking about: logical consistency tests of MC approximations against exact implementation, although that can eat a lot of runtime (so perhaps not?)
  • some distributions, such as empirical or composites, need careful thinking about the subsetting logic - so an extension contract for iloc and loc indexing needs to be thought out so it can affect parameters
  • row/column subsetting via loc, iloc, should be tested
@fkiraly fkiraly added enhancement module:probability&simulation probability distributions and simulators implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality labels Aug 23, 2023
@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 23, 2023

FYI @Alex-JG3 - your feedback would be very much appreciated, given that you implemented a distrubtion.
What could be refactored, what could be improved for the implementer? What would make a better user experience?

@Alex-JG3
Copy link
Contributor

@fkiraly, generally I found it quite clear how to adapt what has been done already to the t-distribution. The implementation of the normal distribution is a great example. Some things that were less clear to me:

  • It wasn't obvious to me that methods that are not implemented are estimated numerically but this is a very cool feature!
  • I don't quite understand the benefit using pandas-like syntax as opposed to numpy-like indexing. I see that you can pass your own index/columns but I myself would probably avoid doing that. It will be cool to see how this plays in some worked examples.

It might be nice to be able to pass data structures other than pandas dataframes (link numpy arrays or lists) into pdf, log_pdf, cdf etc.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 23, 2023

I don't quite understand the benefit using pandas-like syntax as opposed to numpy-like indexing.

I see - imo this is crucial to use it in an sktime-style framework, where you might have multiple named variables, and hierarchical index on the rows (shop/product group/product).
Doing this in numpy would make it very hard to track.

Subsetting by name is also a bonus, e.g., in the hierarchical setting, a hierarchical time series to an interval of days.

It will be cool to see how this plays in some worked examples.

Have you tried using predict_proba of any sktime forecaster? That's one of the primary use cases. These already return BaseDistirbution children (most forecasters make normal forecasts).

The other primary use case is probabilistic regression as in this (at the moment very rudimentary) notebook:
https://github.com/sktime/skpro/blob/main/examples/01_skpro_intro.ipynb

@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 23, 2023

It might be nice to be able to pass data structures other than pandas dataframes (link numpy arrays or lists) into pdf, log_pdf, cdf etc.

Yes, good idea! I also thought about that initially and decided against, as it would incur boilerplate.

However, if we go with the private/public split, we can sandwich boilerplate for conversions.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 23, 2023

It wasn't obvious to me that methods that are not implemented are estimated numerically but this is a very cool feature!

Ah, yes - we should point that out in an extension template.
I opened an issue to track that it still needs to be written and referenced your input there:

#23

@Alex-JG3
Copy link
Contributor

@fkiraly, can I take a look at this one?

  • the broadcasting logic could/should be abstracted to apply to a certain subset of parameters rather than be copy-pasted in every class.

Just to make sure I know what you want, in the classes that inherit the BaseDistribution class, they have to copy-paste the _get_bc_params method. This method should be part of the BaseDistribution class to avoid the need for copy-pasting.

It looks like we could make the _get_bc_params function part of the BaseDistribution class and have it accept a tuple of distribution parameters that it needs to make broadcastable. This function could then be called from the __init__ method in the child class.

Does that sound like a good solution or did you have something else in mind?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 27, 2023

Does that sound like a good solution or did you have something else in mind?

Yes, exactly what I thought!

The task is more general, as in finding the best design for this - but I think your idea is a very good solution. It is how I would have started.

This function could then be called from the __init__ method in the child class.

Or, you could call it in in the parent __init__, and write to self, using self.get_params, but that sounds more convoluted. Not sure whether iti s a good idea.

or did you have something else in mind?

A subtle problem is, if you go through all the examples, it turns out that not always all parameters have to be broadcast, even if you ignore index and columns.

I would solve that with having a dedicated tag that lists all the params that broadcast.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 27, 2023

@fkiraly, can I take a look at this one?

Yes, much obliged!

I will then focus on splitting private/public, we can then put the too together and they should not interact too much.

fkiraly pushed a commit that referenced this issue Sep 9, 2023
Closes #55, see also #21.

1. Adds a test to check whether the log of the `pdf` function is similar
to the `log_pdf` function for each distribution.
2. Adds a test to check that the `ppf` is the inverse of the `cdf`
function. This test required the change described in 1 to use
`np.allclose`.
3. Adds a test to check that `pandas`outputs of methods have all
`numeric` `dtype`-s

Also fixes one instance that would have failed 2, 3:

* Changes the method for converting datatypes in the `_apply_per_ix`
function of the `Empirical` distribution. Switches to use `to_numeric`
instead of `convert_dtypes`.
fkiraly pushed a commit to sktime/sktime that referenced this issue Sep 9, 2023
…stribution param broadcasting in base class (#5176)

Mirror of
[sktime/skpro#21

Moves the `_get_bc_params` method from child distributions to the parent
distribtion class, `BaseDistribution`.

This implementation is quite simple and doesn't use `_tags` and still
means the child distributions have to call the `_get_bc_params` method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality module:probability&simulation probability distributions and simulators
Projects
None yet
Development

No branches or pull requests

2 participants