Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] roadmap of probability distributions to implement #22

Open
11 of 35 tasks
fkiraly opened this issue Aug 23, 2023 · 30 comments
Open
11 of 35 tasks

[ENH] roadmap of probability distributions to implement #22

fkiraly opened this issue Aug 23, 2023 · 30 comments
Labels
feature request New feature or request good first issue Good for newcomers implementing algorithms Implementing algorithms, estimators, objects native to skpro module:probability&simulation probability distributions and simulators

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 23, 2023

It would be great to have a basic set of probability distributions implemented.

Umbrella issue for implementing sktime probability distributions.

Recipe: use the extension_templates/distribution.py extension template.
Examples:

  • Normal, for de-novo implementations or manual interfaces
  • Fisk, for interfacing scipy distributions - this is much easier than using the full template

High priority:

mid priority:

low priority:

lower priority:

  • alpha
  • binomial
  • burr III
  • burr XII
  • erlang
  • f
  • fatigue-life
  • generalized Pareto
  • gamma
  • geometric
  • half-cauchy
  • half-normal
  • half-logistic
  • levy
  • log-gamma
  • log-laplace
  • negative binomial
  • pareto
  • skellam
  • truncated normal
  • truncated pareto

list of many more (lowest priority)
https://docs.scipy.org/doc/scipy/reference/stats.html#probability-distributions - can be interfaced via _ScipyDist adapter easily!
https://en.wikipedia.org/wiki/File:ProbOnto2.5.jpg

Mirrors sktime/sktime#4518
(for high and mid priority)

Contributions can be made to either repository, and should be copied over to the other once approved/merged, until the modules are merged into one.

@fkiraly fkiraly added good first issue Good for newcomers module:probability&simulation probability distributions and simulators implementing algorithms Implementing algorithms, estimators, objects native to skpro feature request New feature or request labels Aug 23, 2023
fkiraly added a commit that referenced this issue Aug 25, 2023
Adds empirical distribution.

Towards #22.

Mirror of sktime/sktime#5094
fkiraly added a commit that referenced this issue Aug 25, 2023
Implements mixture of distributions.

Towards #22, and required for
ensemble regressor.

Also adds a default implementation for `ppf` in the `BaseDistribution`,
using the bisection method to invert a `cdf`, if present.
fkiraly pushed a commit that referenced this issue Aug 27, 2023
<!--
Thanks for contributing a pull request! Please ensure you have taken a
look
at our contribution guide:
https://skbase.readthedocs.io/en/latest/contribute.html
-->

#### Reference Issues/PRs
<!--
Example: Fixes #1234. See also #3456.

Please use keywords (e.g., Fixes) to create link to the issues or pull
requests
you resolved, so that they will automatically be closed when your pull
request
is merged. See
https://github.com/blog/1506-closing-issues-via-pull-requests
-->

Mirror of `sktime` sktime/sktime#5050. Towards #22


#### What does this implement/fix? Explain your changes.
<!--
A clear and concise description of what you have implemented. Remember
to implement
unit tests and docstrings if your pull request commits code to the
repository.
-->

Add student's t-distribution.
@fkiraly fkiraly changed the title [ENH] (wish)list of probability distributions to implement [ENH] roadmap & (wish)list of probability distributions to implement Sep 13, 2023
@fkiraly fkiraly pinned this issue Sep 13, 2023
@fkiraly fkiraly changed the title [ENH] roadmap & (wish)list of probability distributions to implement [ENH] roadmap of probability distributions to implement Sep 13, 2023
@bhavikar04
Copy link
Contributor

Hi, I'm interested in taking this up. Would you say priority of the distributions aligns with the difficulty in implementation? I'd like to do either multivariate normal or uniform continuous.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 12, 2024

Hmmm, I'd say it is currently actually the opposite way. That is, the remaining low priority ones are easier to get started with, than the remaining high priority ones - simply since the easy higher priority ones are already done.

So, uniform continuous then? Parameterized by lower and upper.

I don't have a reference for energy and squared norm integrals, but these should not be too difficult to obtain. Let me know if you need input there, we can always start with the more common methods.

@an20805
Copy link
Contributor

an20805 commented Mar 12, 2024

Hey @fkiraly, I have implemented uniform continuous distribution in my local branch. How do I proceed further?
I would also love to implement other distributions.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 12, 2024

@an20805, nice! Let's not duplicate then, @bhavikar04 - how about beta?

The next step would be making a pull request to this repository, and a review cycle, then merge.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 13, 2024

Re energy, for $X, Y\sim Unif(a, b)$, I get:

$\mathbb{E}[|X - y|] = |y - \frac{b+a}{2} |$ if $y$ lies outside $[a, b]$,
and $\mathbb{E}[|X - y|] = \frac{(b-y)^2}{2(b-a)}+ \frac{(a-y)^2}{2(b-a)}$ if inside,

and

$\mathbb{E}[|X - Y|] = \frac{1}{3} (b-a)$ - double checking appreciated.

@bhavikar04
Copy link
Contributor

In that case I'll take up log normal distribution then.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 13, 2024

Pinging @Alex-JG3 and @ivarzap who most recently implemented distributions, in case you have any general starter advice.

@bhavikar04
Copy link
Contributor

bhavikar04 commented Mar 15, 2024

Hey,

So I'm a little unsure on what the energy will be for the log normal distribution and can't find much online, is there any literature you can point me to?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 15, 2024

@bhavikar04, Appendix A.2 of "evaluating forecasts with scoringRules" has a few explicit formulae for the energy, including the log-normal distribution. The expression is hard to track in implementation, so I would advise comparing against the Monte-Carlo default if you implement it.

I would also suggest you try it on paper, there's a good chance of errors in rare calculations like these.
Further, Wolfram Alpha might also help. Whereas, ChatGPT and the like typically produce plausible garbage.

@bhavikar04
Copy link
Contributor

Hey thank you so much, I'll try to chalk out a suitable implementation soon. ChatGPT was humble enough to admit it doesn't know enough ;)

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 16, 2024

Yes, I admit I also tried as computing integrals can get tiring: https://xkcd.com/2117/
Wolfram is not bad, it makes sense to double check though. As said, there is a default Monte Carlo implementation, so if you set the number of samples high, the matrices should be similar.

@sukjingitsit
Copy link
Contributor

sukjingitsit commented Mar 23, 2024

I would like to work on implementing the chi-square distribution. To confirm, we have to follow the template of Laplace and Normal, where we implement the
mean, var, pdf, cdf, logpdf and ppf alongside the energy, right?
To characterise chi-square, I assume, as standard practice, we will use the degrees of freedom, right?

@sukjingitsit
Copy link
Contributor

sukjingitsit commented Mar 23, 2024

The current implementation of ppf wraps a scipy function directly due to lack of a closed mathematical form. Similarly, while cross-energy can be mathematically derived, self-energy is difficult to solve (nor could I find literature on it) in a closed form, the best options for that is integration or sampling. Thus, energy hasn't been implemented yet

@an20805
Copy link
Contributor

an20805 commented Mar 25, 2024

@an20805, would you be so kind to open a PR with your partial work on uniform distribution, even if not finished? Would be a shame if it got lost.

Hello @fkiraly, I am sorry for delaying the PR this much. I had an accident and wasn't able to work for more than a week. I am all good now and got back yesterday. I have opened a PR #223. I am having some issues, would like to get some help.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 25, 2024

sorry to hear, @an20805. Good to hear you are back! I'll reply on #223.

@malikrafsan
Copy link
Contributor

Hi @fkiraly 👋 I want to start contributing here, I noticed that some of the unchecked tasks are already being assigned/reviewed. Do you have any recommendations on what distributions should I try to implement? I would be more than happy to contribute to this project, Thank you!

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 28, 2024

@malikrafsan, welcome!

The ones which don't have anyone talking about are still available.
The ProbOnto link has more.

We could also look into integer valued, e.g., Binomial, Poisson - these are common for GLM.

@malikrafsan
Copy link
Contributor

Hi @fkiraly I am very interested in implementing Logistic/Weibull Distribution. However, I failed to find the formula for the energy of those two distributions. Can you help me with this? Or is it better if I raise PRs first? Thank you!

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 6, 2024

However, I failed to find the formula for the energy of those two distributions. Can you help me with this? Or is it better if I raise PRs first?

There is an approximative (Monte Carlo) default if this is not easy to obtain - in any case it can be done later (or not at all).

Thanks for contributing!

@malikrafsan
Copy link
Contributor

Ahh, I see, thank you so much for your guidance @fkiraly ! Then my PRs are ready to be reviewed. I would very much love to hear your feedback. However, if you have any reference on energy formula of those two distribution, then I would still very like to implement it, thank you so much!

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 7, 2024

if you have any reference on energy formula of those two distributiom

have you checked in the paper above? If not there, one would have to derive it.

@malikrafsan
Copy link
Contributor

Do you mean this paper?

Appendix A.2 of "evaluating forecasts with scoringRules" has a few explicit formulae for the energy, including the log-normal distribution.

Yes, I have checked the paper but I cannot find the formula. I can only find CRPS and CDF formulas. Does CRPS mean the energy? If so, I think I misunderstood your previous statements

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 9, 2024

Yes, CRPS is closely releated, it is the cross-term minus half the self-term (compare definitions).

The unfortunate bit about the paper is that it only gives CRPS, but not the self-term or cross-term in isolation. However, it should not be too hard to back these out, using that shifting the distribution location by a constant leaves the self-term unchanged, but not the cross-term.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 9, 2024

More precisely, a useful formula to use is

$$\lim_{y \rightarrow \infty} \mbox{CRPS}(y) - y = -\mathbb{E}[X] - \frac{1}{2}\mathbb{E}|X - X'|,$$

i.e., you can obtain the cross-term via taking a limit, if you know the expressions for CRPS and the expectation already.

(the equation follows from observing that the absolute value in $\mathbb{E}\left|y - X \right|$ disappears in the limit)

fkiraly pushed a commit that referenced this issue Apr 17, 2024
Towares
#22

#### What does this implement/fix? Explain your changes.
<!--
A clear and concise description of what you have implemented.
-->
Weibull probability distribution
fkiraly pushed a commit that referenced this issue Apr 18, 2024
Towards #22 

#### What does this implement/fix? Explain your changes.
Lognormal probability distribution
fkiraly pushed a commit that referenced this issue Apr 18, 2024
Towards #22

#### What does this implement/fix? Explain your changes.
Logistic probability distribution
fkiraly pushed a commit that referenced this issue Apr 25, 2024
Implemented Uniform Continuous Probability Distribution, towards
#22
fkiraly pushed a commit that referenced this issue Apr 25, 2024
Addresses #22 for chi-squared case
@malikrafsan malikrafsan mentioned this issue May 4, 2024
6 tasks
fkiraly pushed a commit that referenced this issue May 7, 2024
Towards #22

This PR implements a Beta distribution based on the Scipy Adapter
fkiraly pushed a commit that referenced this issue May 24, 2024
Implements Gamma distribution. Towards #22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers implementing algorithms Implementing algorithms, estimators, objects native to skpro module:probability&simulation probability distributions and simulators
Projects
None yet
Development

No branches or pull requests

5 participants