Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Histogram distribution #335

Closed
wants to merge 31 commits into from
Closed

Conversation

ShreeshaM07
Copy link
Contributor

Reference Issues/PRs

fixes #323

What does this implement/fix? Explain your changes.

Implements the histogram distribution using bin_width and bin_density as the parameters.

Does your contribution introduce a new dependency? If yes, which one?

No

What should a reviewer concentrate their feedback on?

Whether the chosen parameters are suitable to be used.

Did you add any tests for the change?

No

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

@ShreeshaM07
Copy link
Contributor Author

I have just implemented a minimal code so far. Would these parameters be appropriate where I am taking the parameters as bin_width and bin_density.

increase the input array to the range 0.1%*(max(x)-min(x)) in order to include both the points in the histogram.

bin_width

  • If int then divide all equally and find the bin boundary values using x and put in array self.bins
  • else if it is a list divide the x into respective bin_width and find the boundary points and put in array self.bins.

bin_density

  • It is essentially the pdf at a point x and is in the interval [0,1].
  • to find the cdf we need to multiply it with the binwidth and sum it up upto x.

Does this idea seem fine?

@fkiraly
Copy link
Collaborator

fkiraly commented May 16, 2024

Hm, I do not quite get the parametrization here.

We will have to say - let's say, in the scalar case only - where the bins start/end, and how much mass is in them. How is this achieved given the parameters?

I'm not sure I understand, explanation would be appreciated!

An example might help - let's say I have two bins, one from 0.5 to 2, and one from 2 to 7. The first bin has mass 0.3, the second 0.7. How would I construct the distribution?

@fkiraly fkiraly added enhancement module:probability&simulation probability distributions and simulators implementing algorithms Implementing algorithms, estimators, objects native to skpro labels May 16, 2024
@ShreeshaM07
Copy link
Contributor Author

An example might help - let's say I have two bins, one from 0.5 to 2, and one from 2 to 7. The first bin has mass 0.3, the second 0.7. How would I construct the distribution?

So in this example the bin_width would be passed as [1.5,5] and bin_density = [0.3,0.7] and when I pass the x values which is a 1D array/dataframe. Now for finding the values to split the axis into bins I will be using the min(x) and max(x) .

But I think there will be a flaw in this way of doing it. It will be better to implement it using bins itself so then the bins = [0.5, 2 ,7] and bin_density = [0.3,0.7] and then we can apply this for finding the pdf at all the passed x values. I think that would make more sense.

I will work on this tomorrow as I will not be working today hence will not be attending the stand up today please excuse me.

@fkiraly
Copy link
Collaborator

fkiraly commented May 17, 2024

But I think there will be a flaw in this way of doing it.

Agreed, as you do not know where the first bin starts, this way, ie, at 0.5.

I also think your new suggestion is better!

@ShreeshaM07
Copy link
Contributor Author

I have done it that way but I still am not sure if we can handle a single int input for bins.
The code so far is giving expected output on running

x=np.array([1,0.75,1.8,2.5,3,5,6,6.5])
hist = Histogram(bins=[0.5,2,7],bin_density=[0.3,0.7],index=pd.Index(np.arange(3)),columns=pd.Index(np.arange(2)))
pdf = hist._pdf(x)
print(pdf)

Output:

[0.2  0.2  0.2  0.14 0.14 0.14 0.14 0.14]

Does this parameterization make sense?

@fkiraly
Copy link
Collaborator

fkiraly commented May 17, 2024

Yes, I think it does make sense!

I would rename the parameter bin_density to bin_mass though - as a user would understand the value "density" to be the function value in the interval.

bin_density = np.array(self.bin_density.copy())
bins = self.bins
pdf = []
if isinstance(bins, list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main comment, this looks correct, but it is quite inefficient due to the use of loops.

I would strongly advise to use numpy methods for everything.

For instance, for bin widths, use diff.

To find the bin in which the x-value falls, you could use cumsum and np.where with >.

"""
bins = self.bins
# 1 is the cumulative sum of all bin_mass
return 1 / (max(bins) - min(bins))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not correct? Also, you need to be careful about the different cases of bins being int or iterable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take care of the bins cases. But in the case where bins has the bin edges then shouldn't this be the mean as mean = sum(bin_width*bin_height)/sum(bin_width),the numerator is basically area under the histogram which is = 1 and the sum of bin_width would be the range of the bins values thus = max(bins)- min(bins). Is that incorrect?

Copy link
Contributor Author

@ShreeshaM07 ShreeshaM07 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please review if what I have considered for the mean and var is correct or do I have to use E[X] = μ=∫∞−∞x*pdf(x)dx across all the different pdfs for the different bins?

Copy link
Collaborator

@fkiraly fkiraly May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your formula is simply incorrect.
The correct formula for mean is:

Let $b_0, \dots, b_n$ the bin boundaries, and $m_i, 1= 1,\dots, n$ the mass in the bin $[b_{i-1}, b_i]$.

The mean of the histogram distirbution is then

$$\mu =\frac{1}{2} \sum_{i=1}^n (b_i + b_{i-1})\cdot m_i$$

which you can obtain by applying np.dot and a shifted sum.

(this is obtained if you substitute pdf(x) into your formula and carry out the integration correctly)

Copy link
Collaborator

@fkiraly fkiraly May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for "easy" computation of the mean and variance, you can use that the histogram distribution is the same as the two-step conditional where you first sample which bin you are in, with probabilities $m_i$, and then from the uniform within the bin.

Use the conditional formulae for mean and variance on this idea - this also shows why the mean has the above form, as the weighted mean of means of uniform distributions on the bin intervals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes that is correct, I've made a mistake I will correct it now. Thanks for the help.

@ShreeshaM07
Copy link
Contributor Author

A suggestion for another way to input bins would be a tuple(float,float,int) with it representing (bins start, bins ends,number of bins). Will this be a better idea as compared to only a single integer as then we wouldn't be aware of the start and end points of the bins.

@fkiraly
Copy link
Collaborator

fkiraly commented May 20, 2024

Will this be a better idea as compared to only a single integer as then we wouldn't be aware of the start and end points of the bins.

Yes, I think that would be better than just int.

Though, how would we distinguish this from an iterable of length 3, i.e., the other convention on giving the boundaries? Let's say we have start = 0, end = 1, number of bins = 2. This could also be two bins, 0 to 1 and 1 to 2.

@ShreeshaM07
Copy link
Contributor Author

ShreeshaM07 commented May 20, 2024

Though, how would we distinguish this from an iterable of length 3, i.e., the other convention on giving the boundaries? Let's say we have start = 0, end = 1, number of bins = 2. This could also be two bins, 0 to 1 and 1 to 2.

We can use isinstance(bins,tuple) to distinguish as it is a different data structure altogether.

Also any idea what I should be doing to make the CI tests pass?
If I run the code

x = np.array([-1, 0, 0.2, 0.4, 1.1, 1.8, 2, 2.2, 3.5, 5])
hist = Histogram(
    bins=[0, 1, 2, 3, 4],
    bin_mass=[0.1, 0.2, 0, 0.7],
    index=pd.Index(np.arange(3)),
    columns=pd.Index(np.arange(2)),
)
pdf = hist._pdf(x)
print(pdf)
cdf = hist._cdf(x)
print(cdf)
mean = hist._mean()
print(mean)
var = hist._var()
print(var)
p = np.array([-1, 0, 0.02, 0.04, 0.12, 0.26, 0.3, 0.3, 0.8, 1])
ppf = hist._ppf(p)
print(ppf)

without the _tags it works giving expected output

[0.  0.1 0.1 0.1 0.2 0.2 0.  0.  0.7 0. ]
[0.   0.   0.02 0.04 0.12 0.26 0.3  0.3  0.65 1.  ]
0.25
0.07249999999999998
[       nan 0.         0.2        0.4        1.1        1.8
 2.         2.         3.71428571 4.        ]

but if I run it with the _tags it gives error

Traceback (most recent call last):
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/histogram.py", line 202, in <module>
   hist = Histogram(
          ^^^^^^^^^^
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/histogram.py", line 42, in __init__
   super().__init__(index=index, columns=columns)
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 61, in __init__
   self._init_shape_bc(index=index, columns=columns)
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 81, in _init_shape_bc
   bc_params, shape, is_scalar = self._get_bc_params_dict(return_shape=True)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 564, in _get_bc_params_dict
   bc = np.broadcast_arrays(*args_as_np)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/anaconda3/envs/sktime-dev/lib/python3.11/site-packages/numpy/lib/stride_tricks.py", line 540, in broadcast_arrays
   shape = _broadcast_shape(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/anaconda3/envs/sktime-dev/lib/python3.11/site-packages/numpy/lib/stride_tricks.py", line 422, in _broadcast_shape
   b = np.broadcast(*args[:32])
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (4,) and arg 1 with shape (5,).

Is it perhaps related to the size of bins being 1 more than size of bin_mass?

@fkiraly
Copy link
Collaborator

fkiraly commented May 21, 2024

The CI results indicate that you ought to set the capabilities:exact tag - if you're not sure, just copy them from Normal.

@ShreeshaM07
Copy link
Contributor Author

The CI results indicate that you ought to set the capabilities:exact tag - if you're not sure, just copy them from Normal.

I have done that but there when I enable broadcast_init: "on" and run check_estimator it starts producing the shape mismatch error.

Traceback (most recent call last):
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/histogram.py", line 202, in <module>
   hist = Histogram(
          ^^^^^^^^^^
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/histogram.py", line 42, in __init__
   super().__init__(index=index, columns=columns)
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 61, in __init__
   self._init_shape_bc(index=index, columns=columns)
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 81, in _init_shape_bc
   bc_params, shape, is_scalar = self._get_bc_params_dict(return_shape=True)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 564, in _get_bc_params_dict
   bc = np.broadcast_arrays(*args_as_np)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/anaconda3/envs/sktime-dev/lib/python3.11/site-packages/numpy/lib/stride_tricks.py", line 540, in broadcast_arrays
   shape = _broadcast_shape(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/anaconda3/envs/sktime-dev/lib/python3.11/site-packages/numpy/lib/stride_tricks.py", line 422, in _broadcast_shape
   b = np.broadcast(*args[:32])
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (4,) and arg 1 with shape (5,).

Instead if I disable it and then run check_estimator it gives

FAILED: test_log_pdf_and_pdf[Histogram-0]
FAILED: test_log_pdf_and_pdf[Histogram-1]
FAILED: test_methods_p[Histogram-0-ppf-0]
FAILED: test_methods_p[Histogram-0-ppf-1]
FAILED: test_methods_p[Histogram-1-ppf-0]
FAILED: test_methods_p[Histogram-1-ppf-1]
FAILED: test_methods_scalar[Histogram-0-mean-1]
FAILED: test_methods_scalar[Histogram-0-var-1]
FAILED: test_methods_scalar[Histogram-0-energy-0]
FAILED: test_methods_scalar[Histogram-0-energy-1]
FAILED: test_methods_scalar[Histogram-1-energy-0]
FAILED: test_methods_scalar[Histogram-1-energy-1]
FAILED: test_methods_x[Histogram-0-energy-0]
FAILED: test_methods_x[Histogram-0-energy-1]
FAILED: test_methods_x[Histogram-0-pdf-0]
FAILED: test_methods_x[Histogram-0-pdf-1]
FAILED: test_methods_x[Histogram-0-log_pdf-0]
FAILED: test_methods_x[Histogram-0-log_pdf-1]
FAILED: test_methods_x[Histogram-0-cdf-0]
FAILED: test_methods_x[Histogram-0-cdf-1]
FAILED: test_methods_x[Histogram-1-energy-0]
FAILED: test_methods_x[Histogram-1-energy-1]
FAILED: test_methods_x[Histogram-1-pdf-0]
FAILED: test_methods_x[Histogram-1-pdf-1]
FAILED: test_methods_x[Histogram-1-log_pdf-0]
FAILED: test_methods_x[Histogram-1-log_pdf-1]
FAILED: test_methods_x[Histogram-1-cdf-0]
FAILED: test_methods_x[Histogram-1-cdf-1]
FAILED: test_ppf_and_cdf[Histogram-0]
FAILED: test_ppf_and_cdf[Histogram-1]
FAILED: test_quantile[Histogram-0-0]
FAILED: test_quantile[Histogram-0-1]
FAILED: test_quantile[Histogram-1-0]
FAILED: test_quantile[Histogram-1-1]
FAILED: test_sample[Histogram-0-0]
FAILED: test_sample[Histogram-0-1]
FAILED: test_sample[Histogram-1-0]
FAILED: test_sample[Histogram-1-1]

From some debugging I have done it stops working when d.sample() is called in all the methods it is failing(ie eg test_sample,test_log_pdf_and_pdf) in test_all_distrs.py.

Any suggestions?

@fkiraly
Copy link
Collaborator

fkiraly commented May 22, 2024

This is probably because scalar histogram distributions have dimension 1 parameters - so the default broadcasting will not work.

We either need to broadcast manually, or separate bin start/step/end into separate parameters.

I would recommend going back to the drawing board quickly, and think about the parameterization in the case of array distributions.

The closest distribution is perhaps Empirical, but that one is a bit complex.

@ShreeshaM07
Copy link
Contributor Author

ShreeshaM07 commented May 23, 2024

From what I have discovered so far on running the check_estimator(Histogram) the code stops working when d.sample() is called in the following methods

FAILED: test_log_pdf_and_pdf
FAILED: test_methods_scalar
FAILED: test_methods_p
FAILED: test_methods_x
FAILED: test_ppf_and_cdf
FAILED: test_quantile
FAILED: test_sample

upon changing the sample() method's definition in _base.py to

    def sample(self, n_samples=None):
        """Sample from the distribution.

        Parameters
        ----------
        n_samples : int, optional, default = None

        Returns
        -------
        if `n_samples` is `None`:
        returns a sample that contains a single sample from `self`,
        in `pd.DataFrame` mtype format convention, with `index` and `columns` as `self`
        if n_samples is `int`:
        returns a `pd.DataFrame` that contains `n_samples` i.i.d. samples from `self`,
        in `pd-multiindex` mtype format convention, with same `columns` as `self`,
        and `MultiIndex` that is product of `RangeIndex(n_samples)` and `self.index`
        """

        def gen_unif():
            np_unif = np.random.uniform(size=self.shape)
            if self.ndim > 0:
                return pd.DataFrame(np_unif, index=self.index, columns=self.columns)
            return np_unif

        # if ppf is implemented, we use inverse transform sampling
        if self._has_implementation_of("_ppf") or self._has_implementation_of("ppf"):
            if n_samples is None:
->              print(self)
->              gen_u = gen_unif()
->              print('gen_unif():',gen_u)
->              print('self.ppf(gen_u):',self.ppf(gen_u))
->              print()
                return self.ppf(gen_unif())
            # else, we generate n_samples i.i.d. samples
            pd_smpl = [self.ppf(gen_unif()) for _ in range(n_samples)]
            if self.ndim > 0:
                df_spl = pd.concat(pd_smpl, keys=range(n_samples))
            else:
                df_spl = pd.DataFrame(pd_smpl)
            return df_spl

        raise NotImplementedError(self._method_error_msg("sample", "error"))

it gives

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.144533
2  0.672831
3  0.982381
4  0.305598
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.32752392371688177
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.070632
2  0.548879
3  0.493867
4  0.654325
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([4, 3, 1, 2], dtype='int64'))
gen_unif():           a
4  0.753837
3  0.056467
1  0.205197
2  0.449398
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.796151
2  0.442715
3  0.236078
4  0.358719
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([4, 1, 3, 2], dtype='int64'))
gen_unif():           a
4  0.227878
1  0.883603
3  0.342239
2  0.765612
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.739030
2  0.250487
3  0.293729
4  0.366346
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([4, 3, 1, 2], dtype='int64'))
gen_unif():           a
4  0.952818
3  0.899638
1  0.436710
2  0.589254
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.303695
2  0.844856
3  0.557284
4  0.221021
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([3, 1, 4, 2], dtype='int64'))
gen_unif():           a
3  0.309147
1  0.001904
4  0.453861
2  0.737935
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.09178005866116512
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.27242645031451473
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.605223999771608
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.6458844744373595
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.9866219935713674
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.8918519764423243
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.6018637248858842
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.690184451970329
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.917552
2  0.003464
3  0.575790
4  0.394365
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.7186864234668554
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.542048
2  0.321631
3  0.099083
4  0.305769
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([3, 1, 2, 4], dtype='int64'))
gen_unif():           a
3  0.174807
1  0.935915
2  0.176642
4  0.215476
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.44581638590690353
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.28130372318262364

Upon observing this we can see that nothing is returned/printed when self.ppf(gen_u) is called.
But if I replace the self.ppf(gen_u) with self._ppf(gen_u) (ie the private ppf function for the Histogram class) it returns/prints empty lists [ ], which is better than before as it is returning something but still should not be empty.

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.802980
2  0.995196
3  0.031613
4  0.030497
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.9239385922214584
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.835061
2  0.387336
3  0.567604
4  0.065929
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([1, 4, 2, 3], dtype='int64'))
gen_unif():           a
1  0.513867
4  0.679504
2  0.565602
3  0.563634
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.980712
2  0.887288
3  0.627308
4  0.326347
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([3, 4, 1, 2], dtype='int64'))
gen_unif():           a
3  0.365653
4  0.317035
1  0.846843
2  0.631869
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.503843
2  0.170024
3  0.759543
4  0.453832
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([2, 4, 3, 1], dtype='int64'))
gen_unif():           a
2  0.442809
4  0.440365
3  0.146921
1  0.903347
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.476502
2  0.419956
3  0.227457
4  0.814254
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([2, 4, 1, 3], dtype='int64'))
gen_unif():           a
2  0.176053
4  0.832150
1  0.010264
3  0.462891
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.7366702569308315
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.06264155909374769
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.18038786389504946
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.9392302166128812
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.3397618213564747
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.5139319866966213
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.661692092938294
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.8788848884514379
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.422318
2  0.953866
3  0.260746
4  0.588994
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.5487312831264081
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.164937
2  0.933906
3  0.957186
4  0.499551
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([4, 3, 2, 1], dtype='int64'))
gen_unif():           a
4  0.755289
3  0.112055
2  0.202795
1  0.688959
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.9179033059913552
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.8715215723161311

If we can figure out why this self.ppf(gen_unif()) is not returning anything then mostly we will have an approach to the solution for how to go further in the issue.

Maybe we should try making a new sample() definition for Histogram like is being done in case of Empirical.
Thoughts?

@fkiraly
Copy link
Collaborator

fkiraly commented May 23, 2024

Yes, I think a custom sample might be a good way to go.

However, I also think _ppf and the other methods may have to be updated - in fact, the entire distribution does not define properly what it should be doing in the 2D array case. That's why I'd suggest to write a short design for the init params in that case.

@ShreeshaM07
Copy link
Contributor Author

Can you please give suggestions how I can identify if it is a single array distribution ?

  • case1: By single array distribution what I mean is when the parameters passed look like this
bins = [0 , 1 ,2 ,3 , 4]
bin_mass = [0.1 , 0.2, 0.3, 0.4]

This is technically equivalent to scalar input in other distributions like Normal(mu=5,sigma=4) etc where only 1 Histogram Distribution is produced, so there is no broadcasting here anyways.

  • case 2 : The 2D distribution I have implemented using lists which store numpy array distributions in them like for example when the input is this
bins = [
    [ [0,1,2,3,4], [5,5.5,5.8,6.5,7,7.5] ],
    [ (2,12,5), [0,1,2,3,4] ],
    [ [1.5,2.5,3.1,4,5.4], [-4,-2,-1.5,5,10] ]
]
bin_mass = [
    [ [0.1,0.2,0.3,0.4], [0.25,0.1,0,0.4,0.25] ],
    [ [0.1,0.2,0.4,0.2,0.1], [0.4,0.3,0.2,0.1] ],
    [ [0.06,0.15,0.09,0.7], [0.4,0.15,0.325,0.125] ]
]
index = pd.Index(np.arange(3))
columns = pd.Index(np.arange(2))

it converts it in _get_bc_params_dict and stores it like

bins = [
    [ array([0,1,2,3,4]), array([5,5.5,5.8,6.5,7,7.5]) ],
    [ array([2,4,6,8,10,12]), array([0,1,2,3,4]) ],
    [ [array(1.5,2.5,3.1,4,5.4]), array([-4,-2,-1.5,5,10]) ]
]
bin_mass = [
    [ array([0.1,0.2,0.3,0.4]), array([0.25,0.1,0,0.4,0.25]) ],
    [ array([0.1,0.2,0.4,0.2,0.1]), array([0.4,0.3,0.2,0.1]) ],
    [ array([0.06,0.15,0.09,0.7]), array([0.4,0.15,0.325,0.125]) ]
]
index = pd.Index(np.arange(3))
columns = pd.Index(np.arange(2))

Currently I have hardcoded the inputs to be 2D lists always containing numpy arrays in them. So it is working but I cant use numpy.ndim to find the dimensions as it is stored as a list and not numpy arrays and I cannot convert them to numpy.array as they contain inhomogeneous parts.

So can you please suggest how I can efficiently check whether it is a single array distribution(case1) or a multi-array distribution(case2) with 2D inputs.

@ShreeshaM07
Copy link
Contributor Author

Thanks for the quick spotting of the ppf error it was indeed a missed out addition of bins[0] in one of the if conditions now it works perfectly for all values.

@ShreeshaM07
Copy link
Contributor Author

@fkiraly , In the test_all_distr specifically in the test_methods_ppf it was taking the object_instance before shuffling when calling getattr which is incorrect when shuffled is True. So I have rectified that now . Surprised how it did not get flagged with other distributions.

@ShreeshaM07 ShreeshaM07 marked this pull request as ready for review June 10, 2024 20:41
@ShreeshaM07
Copy link
Contributor Author

ShreeshaM07 commented Jun 11, 2024

All the checks have passed except the Test / run-tests-all-extras (3.11, macOS-13) (pull_request). It seems to be some hashing error while installing some packages, I'm not really sure if thats even related to my code. Could you please take a look at it. Apart from that the Histogram Distribution has been completed.

I will make another PR for distributions.rst and __init__.py as it has been changed in skpro after I have started working on my branch and seems to be causing merge conflicts when I merge so it will be better to merge this PR first and then pull the updated main and then make another PR for including it in readthedocs.

@ShreeshaM07 ShreeshaM07 deleted the dev2 branch June 11, 2024 17:04
@ShreeshaM07 ShreeshaM07 mentioned this pull request Jun 11, 2024
5 tasks
@ShreeshaM07
Copy link
Contributor Author

@fkiraly There were some merge conflicts in the init.py in distributions and distributions.rst that were troubling a little so I decided to make a new PR #382 instead, so there are no conflicts.

fkiraly pushed a commit that referenced this pull request Jun 22, 2024
fixes #323 

PR is a new one with updated main merged with #335 as there were some
merge conflicts in `__init__.py` in distributions and also
`distributions.rst` in `api_reference`.


#### What does this implement/fix? Explain your changes.
Implements the histogram distribution using bins and bin_mass as the
parameters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement implementing algorithms Implementing algorithms, estimators, objects native to skpro module:probability&simulation probability distributions and simulators
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[ENH] histogram distribution
2 participants