[ENH] Histogram distribution #335

ShreeshaM07 · 2024-05-16T19:00:35Z

Reference Issues/PRs

fixes #323

What does this implement/fix? Explain your changes.

Implements the histogram distribution using bin_width and bin_density as the parameters.

Does your contribution introduce a new dependency? If yes, which one?

No

What should a reviewer concentrate their feedback on?

Whether the chosen parameters are suitable to be used.

Did you add any tests for the change?

No

PR checklist

For all contributions

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

For new estimators

I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
dependency isolation, see the estimator dependencies guide.

ShreeshaM07 · 2024-05-16T19:09:43Z

I have just implemented a minimal code so far. Would these parameters be appropriate where I am taking the parameters as bin_width and bin_density.

increase the input array to the range 0.1%*(max(x)-min(x)) in order to include both the points in the histogram.

bin_width

If int then divide all equally and find the bin boundary values using x and put in array self.bins
else if it is a list divide the x into respective bin_width and find the boundary points and put in array self.bins.

bin_density

It is essentially the pdf at a point x and is in the interval [0,1].
to find the cdf we need to multiply it with the binwidth and sum it up upto x.

Does this idea seem fine?

fkiraly · 2024-05-16T22:46:08Z

Hm, I do not quite get the parametrization here.

We will have to say - let's say, in the scalar case only - where the bins start/end, and how much mass is in them. How is this achieved given the parameters?

I'm not sure I understand, explanation would be appreciated!

An example might help - let's say I have two bins, one from 0.5 to 2, and one from 2 to 7. The first bin has mass 0.3, the second 0.7. How would I construct the distribution?

ShreeshaM07 · 2024-05-17T07:33:35Z

An example might help - let's say I have two bins, one from 0.5 to 2, and one from 2 to 7. The first bin has mass 0.3, the second 0.7. How would I construct the distribution?

So in this example the bin_width would be passed as [1.5,5] and bin_density = [0.3,0.7] and when I pass the x values which is a 1D array/dataframe. Now for finding the values to split the axis into bins I will be using the min(x) and max(x) .

But I think there will be a flaw in this way of doing it. It will be better to implement it using bins itself so then the bins = [0.5, 2 ,7] and bin_density = [0.3,0.7] and then we can apply this for finding the pdf at all the passed x values. I think that would make more sense.

I will work on this tomorrow as I will not be working today hence will not be attending the stand up today please excuse me.

fkiraly · 2024-05-17T10:28:17Z

But I think there will be a flaw in this way of doing it.

Agreed, as you do not know where the first bin starts, this way, ie, at 0.5.

I also think your new suggestion is better!

ShreeshaM07 · 2024-05-17T12:00:01Z

I have done it that way but I still am not sure if we can handle a single int input for bins.
The code so far is giving expected output on running

x=np.array([1,0.75,1.8,2.5,3,5,6,6.5])
hist = Histogram(bins=[0.5,2,7],bin_density=[0.3,0.7],index=pd.Index(np.arange(3)),columns=pd.Index(np.arange(2)))
pdf = hist._pdf(x)
print(pdf)

Output:

[0.2  0.2  0.2  0.14 0.14 0.14 0.14 0.14]

Does this parameterization make sense?

fkiraly · 2024-05-17T22:06:34Z

Yes, I think it does make sense!

I would rename the parameter bin_density to bin_mass though - as a user would understand the value "density" to be the function value in the interval.

fkiraly · 2024-05-17T22:10:19Z

skpro/distributions/histogram.py

+        bin_density = np.array(self.bin_density.copy())
+        bins = self.bins
+        pdf = []
+        if isinstance(bins, list):


main comment, this looks correct, but it is quite inefficient due to the use of loops.

I would strongly advise to use numpy methods for everything.

For instance, for bin widths, use diff.

To find the bin in which the x-value falls, you could use cumsum and np.where with >.

fkiraly · 2024-05-20T01:01:44Z

skpro/distributions/histogram.py

+        """
+        bins = self.bins
+        # 1 is the cumulative sum of all bin_mass
+        return 1 / (max(bins) - min(bins))


that's not correct? Also, you need to be careful about the different cases of bins being int or iterable.

I will take care of the bins cases. But in the case where bins has the bin edges then shouldn't this be the mean as mean = sum(bin_width*bin_height)/sum(bin_width),the numerator is basically area under the histogram which is = 1 and the sum of bin_width would be the range of the bins values thus = max(bins)- min(bins). Is that incorrect?

Could you please review if what I have considered for the mean and var is correct or do I have to use E[X] = μ=∫∞−∞x*pdf(x)dx across all the different pdfs for the different bins?

I think your formula is simply incorrect.
The correct formula for mean is:

Let $b_0, \dots, b_n$ the bin boundaries, and $m_i, 1= 1,\dots, n$ the mass in the bin $[b_{i-1}, b_i]$.

The mean of the histogram distirbution is then
$$\mu =\frac{1}{2} \sum_{i=1}^n (b_i + b_{i-1})\cdot m_i$$
which you can obtain by applying np.dot and a shifted sum.

(this is obtained if you substitute pdf(x) into your formula and carry out the integration correctly)

for "easy" computation of the mean and variance, you can use that the histogram distribution is the same as the two-step conditional where you first sample which bin you are in, with probabilities $m_i$, and then from the uniform within the bin.

Use the conditional formulae for mean and variance on this idea - this also shows why the mean has the above form, as the weighted mean of means of uniform distributions on the bin intervals.

Oh yes that is correct, I've made a mistake I will correct it now. Thanks for the help.

ShreeshaM07 · 2024-05-20T08:18:35Z

A suggestion for another way to input bins would be a tuple(float,float,int) with it representing (bins start, bins ends,number of bins). Will this be a better idea as compared to only a single integer as then we wouldn't be aware of the start and end points of the bins.

fkiraly · 2024-05-20T15:53:11Z

Will this be a better idea as compared to only a single integer as then we wouldn't be aware of the start and end points of the bins.

Yes, I think that would be better than just int.

Though, how would we distinguish this from an iterable of length 3, i.e., the other convention on giving the boundaries? Let's say we have start = 0, end = 1, number of bins = 2. This could also be two bins, 0 to 1 and 1 to 2.

ShreeshaM07 · 2024-05-20T17:29:21Z

Though, how would we distinguish this from an iterable of length 3, i.e., the other convention on giving the boundaries? Let's say we have start = 0, end = 1, number of bins = 2. This could also be two bins, 0 to 1 and 1 to 2.

We can use isinstance(bins,tuple) to distinguish as it is a different data structure altogether.

Also any idea what I should be doing to make the CI tests pass?
If I run the code

x = np.array([-1, 0, 0.2, 0.4, 1.1, 1.8, 2, 2.2, 3.5, 5])
hist = Histogram(
    bins=[0, 1, 2, 3, 4],
    bin_mass=[0.1, 0.2, 0, 0.7],
    index=pd.Index(np.arange(3)),
    columns=pd.Index(np.arange(2)),
)
pdf = hist._pdf(x)
print(pdf)
cdf = hist._cdf(x)
print(cdf)
mean = hist._mean()
print(mean)
var = hist._var()
print(var)
p = np.array([-1, 0, 0.02, 0.04, 0.12, 0.26, 0.3, 0.3, 0.8, 1])
ppf = hist._ppf(p)
print(ppf)

without the _tags it works giving expected output

[0.  0.1 0.1 0.1 0.2 0.2 0.  0.  0.7 0. ]
[0.   0.   0.02 0.04 0.12 0.26 0.3  0.3  0.65 1.  ]
0.25
0.07249999999999998
[       nan 0.         0.2        0.4        1.1        1.8
 2.         2.         3.71428571 4.        ]

but if I run it with the _tags it gives error

Traceback (most recent call last):
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/histogram.py", line 202, in <module>
   hist = Histogram(
          ^^^^^^^^^^
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/histogram.py", line 42, in __init__
   super().__init__(index=index, columns=columns)
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 61, in __init__
   self._init_shape_bc(index=index, columns=columns)
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 81, in _init_shape_bc
   bc_params, shape, is_scalar = self._get_bc_params_dict(return_shape=True)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 564, in _get_bc_params_dict
   bc = np.broadcast_arrays(*args_as_np)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/anaconda3/envs/sktime-dev/lib/python3.11/site-packages/numpy/lib/stride_tricks.py", line 540, in broadcast_arrays
   shape = _broadcast_shape(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/anaconda3/envs/sktime-dev/lib/python3.11/site-packages/numpy/lib/stride_tricks.py", line 422, in _broadcast_shape
   b = np.broadcast(*args[:32])
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (4,) and arg 1 with shape (5,).

Is it perhaps related to the size of bins being 1 more than size of bin_mass?

fkiraly · 2024-05-21T12:29:20Z

The CI results indicate that you ought to set the capabilities:exact tag - if you're not sure, just copy them from Normal.

ShreeshaM07 · 2024-05-21T20:43:32Z

The CI results indicate that you ought to set the capabilities:exact tag - if you're not sure, just copy them from Normal.

I have done that but there when I enable broadcast_init: "on" and run check_estimator it starts producing the shape mismatch error.

Traceback (most recent call last):
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/histogram.py", line 202, in <module>
   hist = Histogram(
          ^^^^^^^^^^
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/histogram.py", line 42, in __init__
   super().__init__(index=index, columns=columns)
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 61, in __init__
   self._init_shape_bc(index=index, columns=columns)
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 81, in _init_shape_bc
   bc_params, shape, is_scalar = self._get_bc_params_dict(return_shape=True)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/sktime_shreesha/skpro/skpro/distributions/base/_base.py", line 564, in _get_bc_params_dict
   bc = np.broadcast_arrays(*args_as_np)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/anaconda3/envs/sktime-dev/lib/python3.11/site-packages/numpy/lib/stride_tricks.py", line 540, in broadcast_arrays
   shape = _broadcast_shape(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/shreesha/anaconda3/envs/sktime-dev/lib/python3.11/site-packages/numpy/lib/stride_tricks.py", line 422, in _broadcast_shape
   b = np.broadcast(*args[:32])
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (4,) and arg 1 with shape (5,).

Instead if I disable it and then run check_estimator it gives

FAILED: test_log_pdf_and_pdf[Histogram-0]
FAILED: test_log_pdf_and_pdf[Histogram-1]
FAILED: test_methods_p[Histogram-0-ppf-0]
FAILED: test_methods_p[Histogram-0-ppf-1]
FAILED: test_methods_p[Histogram-1-ppf-0]
FAILED: test_methods_p[Histogram-1-ppf-1]
FAILED: test_methods_scalar[Histogram-0-mean-1]
FAILED: test_methods_scalar[Histogram-0-var-1]
FAILED: test_methods_scalar[Histogram-0-energy-0]
FAILED: test_methods_scalar[Histogram-0-energy-1]
FAILED: test_methods_scalar[Histogram-1-energy-0]
FAILED: test_methods_scalar[Histogram-1-energy-1]
FAILED: test_methods_x[Histogram-0-energy-0]
FAILED: test_methods_x[Histogram-0-energy-1]
FAILED: test_methods_x[Histogram-0-pdf-0]
FAILED: test_methods_x[Histogram-0-pdf-1]
FAILED: test_methods_x[Histogram-0-log_pdf-0]
FAILED: test_methods_x[Histogram-0-log_pdf-1]
FAILED: test_methods_x[Histogram-0-cdf-0]
FAILED: test_methods_x[Histogram-0-cdf-1]
FAILED: test_methods_x[Histogram-1-energy-0]
FAILED: test_methods_x[Histogram-1-energy-1]
FAILED: test_methods_x[Histogram-1-pdf-0]
FAILED: test_methods_x[Histogram-1-pdf-1]
FAILED: test_methods_x[Histogram-1-log_pdf-0]
FAILED: test_methods_x[Histogram-1-log_pdf-1]
FAILED: test_methods_x[Histogram-1-cdf-0]
FAILED: test_methods_x[Histogram-1-cdf-1]
FAILED: test_ppf_and_cdf[Histogram-0]
FAILED: test_ppf_and_cdf[Histogram-1]
FAILED: test_quantile[Histogram-0-0]
FAILED: test_quantile[Histogram-0-1]
FAILED: test_quantile[Histogram-1-0]
FAILED: test_quantile[Histogram-1-1]
FAILED: test_sample[Histogram-0-0]
FAILED: test_sample[Histogram-0-1]
FAILED: test_sample[Histogram-1-0]
FAILED: test_sample[Histogram-1-1]

From some debugging I have done it stops working when d.sample() is called in all the methods it is failing(ie eg test_sample,test_log_pdf_and_pdf) in test_all_distrs.py.

Any suggestions?

fkiraly · 2024-05-22T00:23:31Z

This is probably because scalar histogram distributions have dimension 1 parameters - so the default broadcasting will not work.

We either need to broadcast manually, or separate bin start/step/end into separate parameters.

I would recommend going back to the drawing board quickly, and think about the parameterization in the case of array distributions.

The closest distribution is perhaps Empirical, but that one is a bit complex.

ShreeshaM07 · 2024-05-23T09:18:42Z

From what I have discovered so far on running the check_estimator(Histogram) the code stops working when d.sample() is called in the following methods

FAILED: test_log_pdf_and_pdf
FAILED: test_methods_scalar
FAILED: test_methods_p
FAILED: test_methods_x
FAILED: test_ppf_and_cdf
FAILED: test_quantile
FAILED: test_sample

upon changing the sample() method's definition in _base.py to

    def sample(self, n_samples=None):
        """Sample from the distribution.

        Parameters
        ----------
        n_samples : int, optional, default = None

        Returns
        -------
        if `n_samples` is `None`:
        returns a sample that contains a single sample from `self`,
        in `pd.DataFrame` mtype format convention, with `index` and `columns` as `self`
        if n_samples is `int`:
        returns a `pd.DataFrame` that contains `n_samples` i.i.d. samples from `self`,
        in `pd-multiindex` mtype format convention, with same `columns` as `self`,
        and `MultiIndex` that is product of `RangeIndex(n_samples)` and `self.index`
        """

        def gen_unif():
            np_unif = np.random.uniform(size=self.shape)
            if self.ndim > 0:
                return pd.DataFrame(np_unif, index=self.index, columns=self.columns)
            return np_unif

        # if ppf is implemented, we use inverse transform sampling
        if self._has_implementation_of("_ppf") or self._has_implementation_of("ppf"):
            if n_samples is None:
->              print(self)
->              gen_u = gen_unif()
->              print('gen_unif():',gen_u)
->              print('self.ppf(gen_u):',self.ppf(gen_u))
->              print()
                return self.ppf(gen_unif())
            # else, we generate n_samples i.i.d. samples
            pd_smpl = [self.ppf(gen_unif()) for _ in range(n_samples)]
            if self.ndim > 0:
                df_spl = pd.concat(pd_smpl, keys=range(n_samples))
            else:
                df_spl = pd.DataFrame(pd_smpl)
            return df_spl

        raise NotImplementedError(self._method_error_msg("sample", "error"))

it gives

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.144533
2  0.672831
3  0.982381
4  0.305598
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.32752392371688177
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.070632
2  0.548879
3  0.493867
4  0.654325
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([4, 3, 1, 2], dtype='int64'))
gen_unif():           a
4  0.753837
3  0.056467
1  0.205197
2  0.449398
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.796151
2  0.442715
3  0.236078
4  0.358719
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([4, 1, 3, 2], dtype='int64'))
gen_unif():           a
4  0.227878
1  0.883603
3  0.342239
2  0.765612
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.739030
2  0.250487
3  0.293729
4  0.366346
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([4, 3, 1, 2], dtype='int64'))
gen_unif():           a
4  0.952818
3  0.899638
1  0.436710
2  0.589254
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.303695
2  0.844856
3  0.557284
4  0.221021
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([3, 1, 4, 2], dtype='int64'))
gen_unif():           a
3  0.309147
1  0.001904
4  0.453861
2  0.737935
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.09178005866116512
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.27242645031451473
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.605223999771608
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.6458844744373595
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.9866219935713674
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.8918519764423243
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.6018637248858842
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.690184451970329
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.917552
2  0.003464
3  0.575790
4  0.394365
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.7186864234668554
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.542048
2  0.321631
3  0.099083
4  0.305769
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([3, 1, 2, 4], dtype='int64'))
gen_unif():           a
3  0.174807
1  0.935915
2  0.176642
4  0.215476
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.44581638590690353
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.28130372318262364

Upon observing this we can see that nothing is returned/printed when self.ppf(gen_u) is called.
But if I replace the self.ppf(gen_u) with self._ppf(gen_u) (ie the private ppf function for the Histogram class) it returns/prints empty lists [ ], which is better than before as it is returning something but still should not be empty.

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.802980
2  0.995196
3  0.031613
4  0.030497
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.9239385922214584
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.835061
2  0.387336
3  0.567604
4  0.065929
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([1, 4, 2, 3], dtype='int64'))
gen_unif():           a
1  0.513867
4  0.679504
2  0.565602
3  0.563634
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.980712
2  0.887288
3  0.627308
4  0.326347
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([3, 4, 1, 2], dtype='int64'))
gen_unif():           a
3  0.365653
4  0.317035
1  0.846843
2  0.631869
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.503843
2  0.170024
3  0.759543
4  0.453832
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([2, 4, 3, 1], dtype='int64'))
gen_unif():           a
2  0.442809
4  0.440365
3  0.146921
1  0.903347
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.476502
2  0.419956
3  0.227457
4  0.814254
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([2, 4, 1, 3], dtype='int64'))
gen_unif():           a
2  0.176053
4  0.832150
1  0.010264
3  0.462891
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.7366702569308315
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.06264155909374769
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.18038786389504946
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.9392302166128812
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.3397618213564747
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.5139319866966213
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.661692092938294
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.8788848884514379
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.422318
2  0.953866
3  0.260746
4  0.588994
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.5487312831264081
Histogram(bin_mass=[0.1, 0.2, 0.3, 0.4], bins=[0, 1, 2, 3, 4],
          columns=Index(['a'], dtype='object'),
          index=Index([1, 2, 3, 4], dtype='int64'))
gen_unif():           a
1  0.164937
2  0.933906
3  0.957186
4  0.499551
Histogram(bin_mass=array([0.1, 0.2, 0.3, 0.4]),
          bins=array([0., 1., 2., 3., 4.]),
          columns=Index(['a'], dtype='object'),
          index=Index([4, 3, 2, 1], dtype='int64'))
gen_unif():           a
4  0.755289
3  0.112055
2  0.202795
1  0.688959
self._ppf(gen_u): []

Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.9179033059913552
Histogram(bin_mass=[0.1, 0.2, 0, 0.7], bins=(0, 4, 4))
gen_unif(): 0.8715215723161311

If we can figure out why this self.ppf(gen_unif()) is not returning anything then mostly we will have an approach to the solution for how to go further in the issue.

Maybe we should try making a new sample() definition for Histogram like is being done in case of Empirical.
Thoughts?

fkiraly · 2024-05-23T22:38:23Z

Yes, I think a custom sample might be a good way to go.

However, I also think _ppf and the other methods may have to be updated - in fact, the entire distribution does not define properly what it should be doing in the 2D array case. That's why I'd suggest to write a short design for the init params in that case.

ShreeshaM07 · 2024-06-06T14:36:13Z

Can you please give suggestions how I can identify if it is a single array distribution ?

case1: By single array distribution what I mean is when the parameters passed look like this

bins = [0 , 1 ,2 ,3 , 4]
bin_mass = [0.1 , 0.2, 0.3, 0.4]

This is technically equivalent to scalar input in other distributions like Normal(mu=5,sigma=4) etc where only 1 Histogram Distribution is produced, so there is no broadcasting here anyways.

case 2 : The 2D distribution I have implemented using lists which store numpy array distributions in them like for example when the input is this

bins = [
    [ [0,1,2,3,4], [5,5.5,5.8,6.5,7,7.5] ],
    [ (2,12,5), [0,1,2,3,4] ],
    [ [1.5,2.5,3.1,4,5.4], [-4,-2,-1.5,5,10] ]
]
bin_mass = [
    [ [0.1,0.2,0.3,0.4], [0.25,0.1,0,0.4,0.25] ],
    [ [0.1,0.2,0.4,0.2,0.1], [0.4,0.3,0.2,0.1] ],
    [ [0.06,0.15,0.09,0.7], [0.4,0.15,0.325,0.125] ]
]
index = pd.Index(np.arange(3))
columns = pd.Index(np.arange(2))

it converts it in _get_bc_params_dict and stores it like

bins = [
    [ array([0,1,2,3,4]), array([5,5.5,5.8,6.5,7,7.5]) ],
    [ array([2,4,6,8,10,12]), array([0,1,2,3,4]) ],
    [ [array(1.5,2.5,3.1,4,5.4]), array([-4,-2,-1.5,5,10]) ]
]
bin_mass = [
    [ array([0.1,0.2,0.3,0.4]), array([0.25,0.1,0,0.4,0.25]) ],
    [ array([0.1,0.2,0.4,0.2,0.1]), array([0.4,0.3,0.2,0.1]) ],
    [ array([0.06,0.15,0.09,0.7]), array([0.4,0.15,0.325,0.125]) ]
]
index = pd.Index(np.arange(3))
columns = pd.Index(np.arange(2))

Currently I have hardcoded the inputs to be 2D lists always containing numpy arrays in them. So it is working but I cant use numpy.ndim to find the dimensions as it is stored as a list and not numpy arrays and I cannot convert them to numpy.array as they contain inhomogeneous parts.

So can you please suggest how I can efficiently check whether it is a single array distribution(case1) or a multi-array distribution(case2) with 2D inputs.

ShreeshaM07 · 2024-06-07T11:23:12Z

Thanks for the quick spotting of the ppf error it was indeed a missed out addition of bins[0] in one of the if conditions now it works perfectly for all values.

…y distr

ShreeshaM07 · 2024-06-10T08:46:00Z

@fkiraly , In the test_all_distr specifically in the test_methods_ppf it was taking the object_instance before shuffling when calling getattr which is incorrect when shuffled is True. So I have rectified that now . Surprised how it did not get flagged with other distributions.

This reverts commit 17f9836.

ShreeshaM07 · 2024-06-11T06:55:17Z

All the checks have passed except the Test / run-tests-all-extras (3.11, macOS-13) (pull_request). It seems to be some hashing error while installing some packages, I'm not really sure if thats even related to my code. Could you please take a look at it. Apart from that the Histogram Distribution has been completed.

I will make another PR for distributions.rst and __init__.py as it has been changed in skpro after I have started working on my branch and seems to be causing merge conflicts when I merge so it will be better to merge this PR first and then pull the updated main and then make another PR for including it in readthedocs.

ShreeshaM07 · 2024-06-11T17:33:54Z

@fkiraly There were some merge conflicts in the init.py in distributions and distributions.rst that were troubling a little so I decided to make a new PR #382 instead, so there are no conflicts.

fixes #323 PR is a new one with updated main merged with #335 as there were some merge conflicts in `__init__.py` in distributions and also `distributions.rst` in `api_reference`. #### What does this implement/fix? Explain your changes. Implements the histogram distribution using bins and bin_mass as the parameters.

[ENH] Histogram distribution

7af0ff3

bin_width when list

e676a1e

fkiraly added enhancement module:probability&simulation probability distributions and simulators implementing algorithms Implementing algorithms, estimators, objects native to skpro labels May 16, 2024

parameterization different

b401a8f

fkiraly reviewed May 17, 2024

View reviewed changes

ShreeshaM07 added 3 commits May 19, 2024 12:59

pdf using nnp.where

cdd0e64

cdf implemented

181e5b9

mean and var implemented

1e70a02

fkiraly reviewed May 20, 2024

View reviewed changes

ppf implementation

cc8c030

ShreeshaM07 added 2 commits May 21, 2024 15:02

Tuple input for bins

3f205e6

params2 modified

79c4a6c

ShreeshaM07 added 2 commits May 21, 2024 23:55

Rectified mean and variance using E[X] and E[(X-mu)^2]

e266710

energy when x is outside the possible X

9ccbcac

ShreeshaM07 added 4 commits June 7, 2024 02:10

introduced single arr distr along with pre-existing 2D arr dist

5ae4754

plot() made to work and resolved test_sample and some other failing CIs

bc4dec9

mean and var for single arr distr

a75dcf8

ppf corrected when P is in 1st Bin

30ff70f

ShreeshaM07 added 3 commits June 7, 2024 19:21

BaseArrayDistribution inherits BaseDistribution

8357ac5

0 values in bin_mass caught

bbbfd5f

0 val

87af806

ShreeshaM07 mentioned this pull request Jun 9, 2024

[ENH] proba regression: reduction to multiclass classification #378

Closed

solved subsetting issue now shuffle_distr and loc again for true shuffle

ad83f2e

fkiraly mentioned this pull request Jun 9, 2024

[ENH] Multivariate normal probability distribution #375

Open

1 task

test_ppf when shuffled modified & shuffle distr made to work for arra…

0101fdc

…y distr

ShreeshaM07 added 7 commits June 10, 2024 17:01

plot() works now

21ab3f4

energy_x implemented

86bf070

np.floating

41a02f1

energy_self

3589f2f

merged skpro changed files

17f9836

removed distributions.rst and init.py

e665322

Revert "merged skpro changed files"

a92722c

This reverts commit 17f9836.

ShreeshaM07 marked this pull request as ready for review June 10, 2024 20:41

ShreeshaM07 closed this Jun 11, 2024

ShreeshaM07 deleted the dev2 branch June 11, 2024 17:04

ShreeshaM07 mentioned this pull request Jun 11, 2024

[ENH] Histogram distribution #382

Merged

5 tasks

ShreeshaM07 mentioned this pull request Jun 11, 2024

[ENH] histogram distribution #323

Closed

ShreeshaM07 mentioned this pull request Jun 23, 2024

[ENH] Improve efficiency of Histogram Distribution #405

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Histogram distribution #335

[ENH] Histogram distribution #335

ShreeshaM07 commented May 16, 2024

ShreeshaM07 commented May 16, 2024

fkiraly commented May 16, 2024 •

edited

Loading

ShreeshaM07 commented May 17, 2024

fkiraly commented May 17, 2024

ShreeshaM07 commented May 17, 2024

fkiraly commented May 17, 2024

fkiraly May 17, 2024

fkiraly May 20, 2024

ShreeshaM07 May 20, 2024

ShreeshaM07 May 20, 2024 •

edited

Loading

fkiraly May 21, 2024 •

edited

Loading

fkiraly May 21, 2024 •

edited

Loading

ShreeshaM07 May 21, 2024

ShreeshaM07 commented May 20, 2024

fkiraly commented May 20, 2024 •

edited

Loading

ShreeshaM07 commented May 20, 2024 •

edited

Loading

fkiraly commented May 21, 2024

ShreeshaM07 commented May 21, 2024

fkiraly commented May 22, 2024 •

edited

Loading

ShreeshaM07 commented May 23, 2024 •

edited

Loading

fkiraly commented May 23, 2024

ShreeshaM07 commented Jun 6, 2024

ShreeshaM07 commented Jun 7, 2024

ShreeshaM07 commented Jun 10, 2024

ShreeshaM07 commented Jun 11, 2024 •

edited

Loading

ShreeshaM07 commented Jun 11, 2024

[ENH] Histogram distribution #335

[ENH] Histogram distribution #335

Conversation

ShreeshaM07 commented May 16, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

PR checklist

For all contributions

For new estimators

ShreeshaM07 commented May 16, 2024

fkiraly commented May 16, 2024 • edited Loading

ShreeshaM07 commented May 17, 2024

fkiraly commented May 17, 2024

ShreeshaM07 commented May 17, 2024

fkiraly commented May 17, 2024

fkiraly May 17, 2024

Choose a reason for hiding this comment

fkiraly May 20, 2024

Choose a reason for hiding this comment

ShreeshaM07 May 20, 2024

Choose a reason for hiding this comment

ShreeshaM07 May 20, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly May 21, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly May 21, 2024 • edited Loading

Choose a reason for hiding this comment

ShreeshaM07 May 21, 2024

Choose a reason for hiding this comment

ShreeshaM07 commented May 20, 2024

fkiraly commented May 20, 2024 • edited Loading

ShreeshaM07 commented May 20, 2024 • edited Loading

fkiraly commented May 21, 2024

ShreeshaM07 commented May 21, 2024

fkiraly commented May 22, 2024 • edited Loading

ShreeshaM07 commented May 23, 2024 • edited Loading

fkiraly commented May 23, 2024

ShreeshaM07 commented Jun 6, 2024

ShreeshaM07 commented Jun 7, 2024

ShreeshaM07 commented Jun 10, 2024

ShreeshaM07 commented Jun 11, 2024 • edited Loading

ShreeshaM07 commented Jun 11, 2024

fkiraly commented May 16, 2024 •

edited

Loading

ShreeshaM07 May 20, 2024 •

edited

Loading

fkiraly May 21, 2024 •

edited

Loading

fkiraly May 21, 2024 •

edited

Loading

fkiraly commented May 20, 2024 •

edited

Loading

ShreeshaM07 commented May 20, 2024 •

edited

Loading

fkiraly commented May 22, 2024 •

edited

Loading

ShreeshaM07 commented May 23, 2024 •

edited

Loading

ShreeshaM07 commented Jun 11, 2024 •

edited

Loading