ENH: stats: Histogram distribution #6801

thomaskeck · 2016-11-19T19:40:19Z

histogram_gen can be used to create a scipy distribution out of a given histogram.
This is often used to incorporate distributions which can be measured or simulated. The techniques is sometimes called "template fit". For reference: ROOT implements it using a class called RooHistPdf
https://root.cern.ch/doc/master/classRooHistPdf.html

rv_arbitrary in #6466 provides similar functionality.
However, I think a specialized implementation just for histograms does make sense.

This is a split of, of another PR which contained several other distributions #6795 and was originally named template_gen

ev-br

Overall looks good!

I left several minor comments. Bigger comments:

An instance will need to be added to a test loop,
https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L73 and https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L206
Moments, possibly also entropy, would need a special-cased implementation, cf https://gist.github.com/ev-br/85fe13b6d17fb51a2196da9da7b3ad2f
A better name, without a _gen would be welcome. Also this should be advertised better than being hidden in a wall of particular distribution instances. Maybe around https://github.com/scipy/scipy/blob/master/scipy/stats/__init__.py#L19

ev-br · 2016-11-19T20:02:26Z

scipy/stats/_continuous_distns.py

+    -----
+    There are no additional shape parameters except for the loc and scale.
+    The pdf and cdf are defined as stepwise functions from the provided histogram.
+    In particular the cdf is not interpolated between bin boundaries and not differentiable.


This comment is stale, the cdf is piecewise linear.

ev-br · 2016-11-19T20:03:48Z

scipy/stats/_continuous_distns.py

+
+    data = scipy.stats.norm.rvs(size=100000, loc=0, scale=1.5)
+    hist = np.histogram(data, bins=100)
+    template = scipy.stats.histogram_gen(hist)


Might want to ditch the generic example (%(example)) and finish this one, with leading >>>, plotting and all.

ev-br · 2016-11-19T20:07:39Z

scipy/stats/_continuous_distns.py

+        Create a new distribution using the given histogram
+        @param histogram the return value of np.histogram
+        """
+        self.histogram = histogram


I think it's better to add a leading underscore to private attributes (as much as anything's private in python)

histogram is a user-facing parameter, hence it needs to be documented in the class docstring, in the numpydoc format. It's assumed to be a 2-tuple of 1D array-likes, where the first one is one longer then the second one, correct?
Also need to validate the input a bit. At least check the lengths, wrap pdf, bins into np.asarray in case they are lists etc.

I added a leading _ to all the attributes,
and added numpydoc style description of the histogram parameter

ev-br · 2016-11-19T20:14:46Z

scipy/stats/_continuous_distns.py

+        """
+        self.histogram = histogram
+        pdf, bins = self.histogram
+        bin_widths = (np.roll(bins, -1) - bins)[:-1]


this is equivalent to bins[1:] - bins[:-1], correct?

Yes, I changed it, bins[1:] - bins[:-1] is more obvious

ev-br · 2016-11-19T20:15:30Z

scipy/stats/_continuous_distns.py

+        pdf, bins = self.histogram
+        bin_widths = (np.roll(bins, -1) - bins)[:-1]
+        pdf = pdf / float(np.sum(pdf * bin_widths))
+        cdf = np.cumsum(pdf * bin_widths)[:-1]


you chop up the last element, and then prepend a one three lines below?

Correct, this was unnecessary.

ev-br · 2016-11-19T20:18:53Z

scipy/stats/_continuous_distns.py

+        """
+        PDF of the histogram
+        """
+        return self.template_pdf[np.digitize(x, bins=self.template_bins)]


Just to double-check, this does not need special treatment for x=a or x=b? I seem to remember interp1d based search had to be corrected for an edge case. This will need to be explicitly tested, too.

This will return 0 for x=b. (so b is already part of the "overflow" bin)
In my opinion this is fine for a continuous distribution.

ev-br · 2016-11-19T20:19:47Z

scipy/stats/_continuous_distns.py

+        """
+        CDF calculated from the histogram
+        """
+        return np.interp(x, self.template_bins, self.template_cdf)


can also provide _ppf, by interpolating the inverse.

ev-br · 2016-11-19T20:20:37Z

scipy/stats/_continuous_distns.py

+        probabilities = self.template_pdf[1:-1]
+        choices = np.random.choice(len(self.template_pdf) - 2, size=self._size, p=probabilities / probabilities.sum())
+        uniform = np.random.uniform(size=self._size)
+        return self.template_bins[choices] + uniform * self.template_bin_widths[choices]


this can be implemented in terms of _ppf (automatically)

Is there an advantage using the automatically provided _rvs implementation?
if I find some time I do a speed comparison, otherwise I would just keep the current implementation.

Implementing the _ppf method is worthwhile in itself. Once the _ppf method is written then there is no point in also having _rvs because it's well tested.

I implemented _ppf, but I thought that my _rvs implementation is faster than using the default _rvs implementation, thus I wanted to keep it.
However, I tested it and the default implementation is twice as fast, and the quality of the random numbers is comparable :-)

So I remove the _rvs.

ev-br · 2016-11-19T20:23:47Z

scipy/stats/tests/test_distributions.py

+        assert_almost_equal(self.template.cdf(8.0), 22.0/25.0)
+        assert_almost_equal(self.template.cdf(8.5), 23.5/25.0)
+        assert_almost_equal(self.template.cdf(9.0), 25.0/25.0)
+        assert_almost_equal(self.template.cdf(10.0), 25.0/25.0)


Would be nice to fold this into a single call with a list of values and a list of results. Which would also check vectorized evaluations.

thomaskeck · 2016-11-27T20:49:52Z

I renamed the class to rv_histogram (as in the gist mentioned above).

Implemented _munp, _entropy, _ppf.
Added more unittests

Added a test instance to test_continuous_basics.py.
Some tests in test_continuous_basics.py fail, I have to investigate what wents wrong here.

It seems there is a test, which tries to use complex numbers. How do I define that the distributions does not support complex numbers?

thomaskeck · 2016-11-27T20:56:16Z

scipy/stats/_continuous_distns.py

+
+    Behaves like an ordinary scipy rv_continuous distribution
+    >>> hist_dist.pdf(1.0)
+    3.5


This number is a placeholder,
I didn't find out yet, howto run the docstring tests. :-)

The incantation is $ python runtests.py --refguide-check -s stats. Obvious, isn't it :-)

ev-br · 2016-11-27T21:33:51Z

It seems there is a test, which tries to use complex numbers. How do I define that the distributions does not support complex numbers?

https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L60

ev-br · 2016-11-28T10:28:06Z

scipy/stats/tests/test_continuous_basic.py

-                   'vonmises', 'vonmises_line',])
+                   'vonmises', 'vonmises_line', 'test_histogram_instance'])
+
+stats.test_histogram_instance = stats.rv_histogram(np.histogram([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7,7,8,8,9], bins=8))


Lines should be less than 80 chars please.

ev-br · 2017-01-18T21:29:56Z

A couple of style fixes: https://github.com/ev-br/scipy/tree/pr/6801, the relevant commit is 2723ff6

This now looks good to me.
@andyfaff would you be able to have a look at it? Since it directly relates to #6466, I guess it'd be great if you could either sign it off or veto it.

thomaskeck · 2017-01-18T22:19:53Z

@ev-br I rebased my branch to yours to include your changes in the pull request.
I hope this was the correct way of doing this.

ev-br · 2017-01-18T22:32:41Z

@thomaskeck As you're familiar with rebasing, it'd be good to squash your three commits and rewrite the commit message to start with "ENH" (cf numpy dev guide for these prefixes)

josef-pkt · 2017-01-18T23:11:16Z

LGTM too

ev-br · 2017-01-18T23:33:54Z

Actually, test failures seem real, and I somehow did not get those when testing locally. So yeah, status is back to needs-work, would you be able to look at those @thomaskeck

thomaskeck · 2017-01-18T23:44:58Z

Last time I checked I wasn't able to reproduce the failures locally as well.
But I give it another try on the weekend.

josef-pkt · 2017-01-19T00:39:18Z

I can get an error message like that with a 2-D array in digitize, but I don't know why the distribution code doesn't ravel (or squeeze).

I have also np.__version__ '1.10.4' where np.digitize works for 2-D arrays

There might be a distribution internal code path that assumes that ._pdf also works for 2-D. To avoid going through the full wrapper .pdf treatment which AFAIR always converts to 1-D, it might be better just to vectorize the _pdf in the histogram distribution for older numpy. AFAIK, this is the first time we run into this problem.
(IIRC, the distribution use np.vectorize internally which might extend the private, distribution specific methods also to 2-D.)

>>> np.__version__
'1.9.2rc1'
>>> np.digitize(np.sin(10 * np.linspace(0, 1, 10).reshape(2, -1)), bins=np.linspace(-1, 1, 5))

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    np.digitize(np.sin(10 * np.linspace(0, 1, 10).reshape(2, -1)), bins=np.linspace(-1, 1, 5))
ValueError: object too deep for desired array
>>> np.digitize(np.sin(10 * np.linspace(0, 1, 10)), bins=np.linspace(-1, 1, 5))
array([3, 4, 4, 2, 1, 1, 3, 4, 4, 1])
>>> np.digitize(np.sin(10 * np.linspace(0, 1, 10)[:,None]), bins=np.linspace(-1, 1, 5))

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    np.digitize(np.sin(10 * np.linspace(0, 1, 10)[:,None]), bins=np.linspace(-1, 1, 5))
ValueError: object too deep for desired array

ev-br · 2017-01-19T11:52:15Z

https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.digitize.html:

x : array_like
Input array to be binned. Prior to Numpy 1.10.0, this array had to be 1-dimensional, but can now have any shape.

Maybe the easiest thing to do here is to just use np.searchsorted and np.where instead of digitize.

andyfaff · 2017-01-19T16:53:25Z

My original thoughts in this space were that the stats module has the opportunity for arbitrary continuous (I haven't thought about discrete) distributions. Those arbitrary distributions could be specified in one of three related way:

supplying a python function(s) that would return the PDF (CDF/PPF for speed).
supplying a sampled function, specified as x/PDF pairs. In this case the PDF would probably be linearly interpolated between points.
supplying a histogram, such as that specified in this PR.

In a WIP a while ago I tried to make a class that would address all 3, one of the comments was that it have been a bit too complicated trying to put them together.
It's worth noting that by only implementing (1) then (2) can be achieved in user-space, by interpolating the data with a piece-wise UnivariateSpline.
(3) and (2) are similar if the histogram is sampled at high frequency.

Before this PR is merged I just wanted to check that the scope/names are right between #6466 and this. Does the similarity between (3) and (2) mean that (2) should be included here? If not then I think I think this PR is probably good to go (although I've not reviewed it in detail). Given that I was only intending to cover (1) in #6466 that'll leave (2) without an implementation. Perhaps there needs to be a sampled_gen as well?

@josef-pkt.

josef-pkt · 2017-01-19T17:28:47Z

@andyfaff In my opinion adding an explicit histogram distribution does not prevent also adding your more general solution.
As I said before, a histogram distribution is relatively simple and can be fast, or is easier to optimize and has less overhead than a generic class. Also the name is obvious, and might find general use if there are not a huge amount of segments.
(Some api advantage here is that it is frozen by design, i.e. no args, kwargs in the methods.)

The more general #6466 would still be useful if we want an approximation to a smooth density function, e.g. continuous function similar to the existing distributions, or either linear or smooth interpolation with a fine grid or many support points. One problem is that this is more difficult to design.
(Related aside: in statsmodels we ran into design problems for kernel density classes that wanted to have exact smooth and fine grid solution for speed at the same time.)

…t of a given histogram.

thomaskeck · 2017-01-22T17:01:28Z

I updated the PR to resolv merge conflicts with the current master,
replaced np.digitize with np.searchsorted (unittests work locally, I have to wait for the automatic checks from travis ci to check if it works).

I also went throught PR checklist (which I should have done earlier)

added versionadded to rv_histogram, and argus_gen
added myself to thanks.txt

…rrays in unittests, added versionadded to rv_histogram and argus_gen Fixed pyflake issue.

ev-br · 2017-01-25T15:41:29Z

Ugh, GH online conflict resolution tool is nice, especially for trivial conflicts like here (one line conflict in THANKS.txt), but autogenerated commit message is misleading, 9d1b540
We certainly do not recommend merging master into feature branches.

Edit: It did merge master into the feature branch.

ev-br · 2017-01-25T15:57:45Z

OK, we seem to have converged that this PR does not conflict with gh-6466, and that various forms of interpolated pdfs belong over there, not to this PR. Both Andrew and Josef are in favor, so am I. Merging, thank you @thomaskeck, @josef-pkt @andyfaff

thomaskeck mentioned this pull request Nov 19, 2016

ENH: stats: Additional distributions for scipy.stats commonly used in High Energy Physics #6795

Closed

ev-br requested changes Nov 19, 2016

View reviewed changes

ev-br added scipy.stats enhancement A new feature or improvement labels Nov 19, 2016

thomaskeck commented Nov 27, 2016

View reviewed changes

ev-br reviewed Nov 28, 2016

View reviewed changes

apbard mentioned this pull request Dec 17, 2016

Percentile Linear returns incorrect value. numpy/numpy#7875

Closed

pv added the needs-work Items that are pending response from the author label Dec 21, 2016

thomaskeck force-pushed the Histogram branch from e670f46 to 2723ff6 Compare January 18, 2017 22:15

ev-br approved these changes Jan 18, 2017

View reviewed changes

thomaskeck force-pushed the Histogram branch from 2723ff6 to 0d1e096 Compare January 18, 2017 22:38

thomaskeck and others added 2 commits January 22, 2017 16:54

ENH: Implemented rv_histogram, used to create a scipy distribution ou…

93770e7

…t of a given histogram.

MAINT: stats: style changes in rv_histogram

a435a76

thomaskeck force-pushed the Histogram branch from 0d1e096 to a435a76 Compare January 22, 2017 15:56

thomaskeck added 2 commits January 22, 2017 18:43

Replaced np.digitize with np.searchsorted to avoid problems with 2d a…

7e42614

…rrays in unittests, added versionadded to rv_histogram and argus_gen Fixed pyflake issue.

Added myself to THANKS.txt

6e4b6f4

thomaskeck force-pushed the Histogram branch from 3d0b734 to 6e4b6f4 Compare January 22, 2017 17:44

Merge branch 'master' into Histogram

9d1b540

ev-br merged commit 63a2f2f into scipy:master Jan 25, 2017

ev-br removed the needs-work Items that are pending response from the author label Jan 25, 2017

ev-br added this to the 0.19.0 milestone Jan 25, 2017

thomaskeck mentioned this pull request Jan 26, 2017

ENH: stats: Mixture distribution #6800

Closed

andyfaff mentioned this pull request Apr 4, 2017

ENH: add rv_scatter distribution #7257

Closed

ENH: stats: Histogram distribution #6801

ENH: stats: Histogram distribution #6801

Conversation

thomaskeck commented Nov 19, 2016

ev-br left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomaskeck commented Nov 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ev-br commented Nov 27, 2016

Choose a reason for hiding this comment

ev-br commented Jan 18, 2017

thomaskeck commented Jan 18, 2017

ev-br commented Jan 18, 2017

josef-pkt commented Jan 18, 2017

ev-br commented Jan 18, 2017 • edited

thomaskeck commented Jan 18, 2017

josef-pkt commented Jan 19, 2017 • edited

ev-br commented Jan 19, 2017

andyfaff commented Jan 19, 2017

josef-pkt commented Jan 19, 2017

thomaskeck commented Jan 22, 2017 • edited

ev-br commented Jan 25, 2017 • edited

ev-br commented Jan 25, 2017

ev-br commented Jan 18, 2017 •

edited

josef-pkt commented Jan 19, 2017 •

edited

thomaskeck commented Jan 22, 2017 •

edited

ev-br commented Jan 25, 2017 •

edited