Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Univariate | Scipy scaffolding #204

Closed
gbonomib opened this issue Jan 15, 2021 · 3 comments
Closed

Univariate | Scipy scaffolding #204

gbonomib opened this issue Jan 15, 2021 · 3 comments
Labels
resolution:duplicate This issue or pull request already exists

Comments

@gbonomib
Copy link
Contributor

gbonomib commented Jan 15, 2021

Hello guys,

I have noticed that many of the univariate distr of Copulas are wrappers around scipy implementation. This make sense as it allows to offload most of the maintenance duties to the scipy devs. I am thinking whether it is appropriate to write some kind of wrapper to bulk-import all/an opinionated subset of continuos probability distribution from scipy (in order not to have to re-write a wrapper for every single univariate distr we want to include into Copulas).

Note that all scipy continuous distribution inherits the rv_continuos class which automagically implements many useful methods, including fitting (fit) and sampling (rvs). In addition, all the scipy distribution are re-parametrized as location-scale family, which is convenient for tractability.

This change would have the benefit of getting ""for free"" new univariate distributions as they are released in scipy

I can have a look into that, let me know if this also make sense to you.

Cheers
gab

@csala
Copy link
Contributor

csala commented Jan 22, 2021

Thanks for the proposal @gbonomib

This would be excellent, and in a way the current implementation tries to approach your proposal by making the integration of new scipy distributions as simple as possible, but there are a few challenges that may prevent us from automating this even more:

  1. We need to properly capture the distribution parameters inside self._params so to_dict and from_dict work as expected.
  2. We need to be able to encode the constant value scenario within self._params, and later on be able to decode it.
  3. This is possibly the most important one: the parameters that we use to encode the constant values scenario should be equal to the limit of the parameters as the data approaches the constant scenario. So, in other words: fitting the distribution to an array of constant values should produce parameters which are very close to the parameters produced when fitting to the same array with just very small variations.
  4. We need to flag the type of distribution (unbounded, semi-bounded or bounded).

Having said that, if we find a way to deduce this information (or scipy exposes it in some way), we can definitely consider automating this more as you are suggesting. Any ideas in that sense?

@gbonomib
Copy link
Contributor Author

Hi @csala

A very basic proof of concept: keep in mind that every continuous/discrete distribution in scipy inherits from the rv_continuos/rv_discrete class.

import numpy as np
from scipy import stats
np.random.seed(123)
distr_name = 'loglaplace' # any distribution in scipy.stats
distr = getattr(stats, 'loglaplace')

  1. We need to properly capture the distribution parameters inside self._params so to_dict and from_dict work as expected.

You can expose shapes argument with from .shapes

>>> distr.shapes
'c'

You can create your _params dictionary by fitting distr and performing a dictionary comprehension afterwards

>> args = distr.fit(distr.rvs(1, size=100))
>> {shape:arg for shape, arg in zip([distr.shapes, 'loc', 'scale'], args)}
{'c': 1.301364908200763, 'loc': 0.022224374785521517, 'scale': 0.9915819258549647}

  1. We need to be able to encode the constant value scenario within self._params, and later on be able to decode it.

If you fit a constant array by imposing loc = 1 and scale = 0, the other parameters of the distribution will stay unchanged even for small perturbation of the constant vector

>> args_constant = distr.fit(np.linspace(1,1,100), floc=1, fscale=0)
>> {shape:arg for shape, arg in zip([distr.shapes, 'loc', 'scale'], args_constant )}
{'c': 1.0, 'loc': 1, 'scale': 0}

  1. This is possibly the most important one: the parameters that we use to encode the constant values scenario should be equal to the limit of the parameters as the data approaches the constant scenario. So, in other words: fitting the distribution to an array of constant values should produce parameters which are very close to the parameters produced when fitting to the same array with just very small variations

Is this really the strongest constraint? This condition on asymptoticity does not seem to be satisfied even for the univariate distributions already implemented in copulas. e.g. from loglaplace

def _fit_constant(self, X):
    self._params = {
        'c': 2.0,
        'loc': np.unique(X)[0],
        'scale': 0.0,
    } 

yet - unless I am mistaken - when you try to fix a loglaplace distribution on a constant values, e.g. np.linspace(1,1,5), the above {c:2, loc:1, scale:0} do not seem to correspond to its asympotic values (which do not really exists, unless you enforce n-1 arguments).

What one could do - as mentioned in 2. - is to enforce floc and fscale when fitting a constant vector, so that all the other parameters will be self-determined and independent from small fluctuation of the constant vector. However this does not pose ant guarantee on the asymptotic values of the parameter.


  1. We need to flag the type of distribution (unbounded, semi-bounded or bounded).

You can define the boundaries of any distribution by inspecting .a and .b. Simple logic based on these will allow to define the distribution type.

# lower bound
>>> distr.a
0.0
# upper bound
>>> distr.b
inf

@npatki
Copy link

npatki commented Aug 10, 2023

Hi all, I recognize this issue is over 2 years old now. The team is actively thinking about enabling this functionality for our users. We have a new issue #357 that we are using to track this.

I'm linking issue so that we don't lose track of prior discussions as we figure out the best way to add this functionality -- and marking this as a duplicate. For any future discussions, please refer to #357.

@npatki npatki closed this as completed Aug 10, 2023
@npatki npatki added the resolution:duplicate This issue or pull request already exists label Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants