Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats: Histogram distribution #6801

Merged
merged 5 commits into from Jan 25, 2017
Merged

ENH: stats: Histogram distribution #6801

merged 5 commits into from Jan 25, 2017

Conversation

thomaskeck
Copy link
Contributor

histogram_gen can be used to create a scipy distribution out of a given histogram.
This is often used to incorporate distributions which can be measured or simulated. The techniques is sometimes called "template fit". For reference: ROOT implements it using a class called RooHistPdf
https://root.cern.ch/doc/master/classRooHistPdf.html

rv_arbitrary in #6466 provides similar functionality.
However, I think a specialized implementation just for histograms does make sense.

This is a split of, of another PR which contained several other distributions #6795 and was originally named template_gen

Copy link
Member

@ev-br ev-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

I left several minor comments. Bigger comments:

  1. An instance will need to be added to a test loop,
    https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L73 and https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L206
  2. Moments, possibly also entropy, would need a special-cased implementation, cf https://gist.github.com/ev-br/85fe13b6d17fb51a2196da9da7b3ad2f
  3. A better name, without a _gen would be welcome. Also this should be advertised better than being hidden in a wall of particular distribution instances. Maybe around https://github.com/scipy/scipy/blob/master/scipy/stats/__init__.py#L19

-----
There are no additional shape parameters except for the loc and scale.
The pdf and cdf are defined as stepwise functions from the provided histogram.
In particular the cdf is not interpolated between bin boundaries and not differentiable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is stale, the cdf is piecewise linear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


data = scipy.stats.norm.rvs(size=100000, loc=0, scale=1.5)
hist = np.histogram(data, bins=100)
template = scipy.stats.histogram_gen(hist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to ditch the generic example (%(example)) and finish this one, with leading >>>, plotting and all.

Create a new distribution using the given histogram
@param histogram the return value of np.histogram
"""
self.histogram = histogram
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to add a leading underscore to private attributes (as much as anything's private in python)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

histogram is a user-facing parameter, hence it needs to be documented in the class docstring, in the numpydoc format. It's assumed to be a 2-tuple of 1D array-likes, where the first one is one longer then the second one, correct?
Also need to validate the input a bit. At least check the lengths, wrap pdf, bins into np.asarray in case they are lists etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a leading _ to all the attributes,
and added numpydoc style description of the histogram parameter

"""
self.histogram = histogram
pdf, bins = self.histogram
bin_widths = (np.roll(bins, -1) - bins)[:-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is equivalent to bins[1:] - bins[:-1], correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed it, bins[1:] - bins[:-1] is more obvious

pdf, bins = self.histogram
bin_widths = (np.roll(bins, -1) - bins)[:-1]
pdf = pdf / float(np.sum(pdf * bin_widths))
cdf = np.cumsum(pdf * bin_widths)[:-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you chop up the last element, and then prepend a one three lines below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this was unnecessary.

"""
PDF of the histogram
"""
return self.template_pdf[np.digitize(x, bins=self.template_bins)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double-check, this does not need special treatment for x=a or x=b? I seem to remember interp1d based search had to be corrected for an edge case. This will need to be explicitly tested, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return 0 for x=b. (so b is already part of the "overflow" bin)
In my opinion this is fine for a continuous distribution.

"""
CDF calculated from the histogram
"""
return np.interp(x, self.template_bins, self.template_cdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also provide _ppf, by interpolating the inverse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

probabilities = self.template_pdf[1:-1]
choices = np.random.choice(len(self.template_pdf) - 2, size=self._size, p=probabilities / probabilities.sum())
uniform = np.random.uniform(size=self._size)
return self.template_bins[choices] + uniform * self.template_bin_widths[choices]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be implemented in terms of _ppf (automatically)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage using the automatically provided _rvs implementation?
if I find some time I do a speed comparison, otherwise I would just keep the current implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing the _ppf method is worthwhile in itself. Once the _ppf method is written then there is no point in also having _rvs because it's well tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented _ppf, but I thought that my _rvs implementation is faster than using the default _rvs implementation, thus I wanted to keep it.
However, I tested it and the default implementation is twice as fast, and the quality of the random numbers is comparable :-)

So I remove the _rvs.

assert_almost_equal(self.template.cdf(8.0), 22.0/25.0)
assert_almost_equal(self.template.cdf(8.5), 23.5/25.0)
assert_almost_equal(self.template.cdf(9.0), 25.0/25.0)
assert_almost_equal(self.template.cdf(10.0), 25.0/25.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to fold this into a single call with a list of values and a list of results. Which would also check vectorized evaluations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ev-br ev-br added scipy.stats enhancement A new feature or improvement labels Nov 19, 2016
@thomaskeck
Copy link
Contributor Author

I renamed the class to rv_histogram (as in the gist mentioned above).

Implemented _munp, _entropy, _ppf.
Added more unittests

Added a test instance to test_continuous_basics.py.
Some tests in test_continuous_basics.py fail, I have to investigate what wents wrong here.

It seems there is a test, which tries to use complex numbers. How do I define that the distributions does not support complex numbers?


Behaves like an ordinary scipy rv_continuous distribution
>>> hist_dist.pdf(1.0)
3.5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number is a placeholder,
I didn't find out yet, howto run the docstring tests. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incantation is $ python runtests.py --refguide-check -s stats. Obvious, isn't it :-)

@ev-br
Copy link
Member

ev-br commented Nov 27, 2016

It seems there is a test, which tries to use complex numbers. How do I define that the distributions does not support complex numbers?

https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L60

'vonmises', 'vonmises_line',])
'vonmises', 'vonmises_line', 'test_histogram_instance'])

stats.test_histogram_instance = stats.rv_histogram(np.histogram([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7,7,8,8,9], bins=8))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines should be less than 80 chars please.

@pv pv added the needs-work Items that are pending response from the author label Dec 21, 2016
@ev-br
Copy link
Member

ev-br commented Jan 18, 2017

A couple of style fixes: https://github.com/ev-br/scipy/tree/pr/6801, the relevant commit is 2723ff6

This now looks good to me.
@andyfaff would you be able to have a look at it? Since it directly relates to #6466, I guess it'd be great if you could either sign it off or veto it.

@thomaskeck
Copy link
Contributor Author

@ev-br I rebased my branch to yours to include your changes in the pull request.
I hope this was the correct way of doing this.

@ev-br
Copy link
Member

ev-br commented Jan 18, 2017

@thomaskeck As you're familiar with rebasing, it'd be good to squash your three commits and rewrite the commit message to start with "ENH" (cf numpy dev guide for these prefixes)

@josef-pkt
Copy link
Member

LGTM too

@ev-br
Copy link
Member

ev-br commented Jan 18, 2017

Actually, test failures seem real, and I somehow did not get those when testing locally. So yeah, status is back to needs-work, would you be able to look at those @thomaskeck

@thomaskeck
Copy link
Contributor Author

Last time I checked I wasn't able to reproduce the failures locally as well.
But I give it another try on the weekend.

@josef-pkt
Copy link
Member

josef-pkt commented Jan 19, 2017

I can get an error message like that with a 2-D array in digitize, but I don't know why the distribution code doesn't ravel (or squeeze).

I have also np.__version__ '1.10.4' where np.digitize works for 2-D arrays

There might be a distribution internal code path that assumes that ._pdf also works for 2-D. To avoid going through the full wrapper .pdf treatment which AFAIR always converts to 1-D, it might be better just to vectorize the _pdf in the histogram distribution for older numpy. AFAIK, this is the first time we run into this problem.
(IIRC, the distribution use np.vectorize internally which might extend the private, distribution specific methods also to 2-D.)

>>> np.__version__
'1.9.2rc1'
>>> np.digitize(np.sin(10 * np.linspace(0, 1, 10).reshape(2, -1)), bins=np.linspace(-1, 1, 5))

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    np.digitize(np.sin(10 * np.linspace(0, 1, 10).reshape(2, -1)), bins=np.linspace(-1, 1, 5))
ValueError: object too deep for desired array
>>> np.digitize(np.sin(10 * np.linspace(0, 1, 10)), bins=np.linspace(-1, 1, 5))
array([3, 4, 4, 2, 1, 1, 3, 4, 4, 1])
>>> np.digitize(np.sin(10 * np.linspace(0, 1, 10)[:,None]), bins=np.linspace(-1, 1, 5))

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    np.digitize(np.sin(10 * np.linspace(0, 1, 10)[:,None]), bins=np.linspace(-1, 1, 5))
ValueError: object too deep for desired array

@ev-br
Copy link
Member

ev-br commented Jan 19, 2017

https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.digitize.html:

x : array_like
Input array to be binned. Prior to Numpy 1.10.0, this array had to be 1-dimensional, but can now have any shape.

Maybe the easiest thing to do here is to just use np.searchsorted and np.where instead of digitize.

@andyfaff
Copy link
Contributor

My original thoughts in this space were that the stats module has the opportunity for arbitrary continuous (I haven't thought about discrete) distributions. Those arbitrary distributions could be specified in one of three related way:

  1. supplying a python function(s) that would return the PDF (CDF/PPF for speed).
  2. supplying a sampled function, specified as x/PDF pairs. In this case the PDF would probably be linearly interpolated between points.
  3. supplying a histogram, such as that specified in this PR.

In a WIP a while ago I tried to make a class that would address all 3, one of the comments was that it have been a bit too complicated trying to put them together.
It's worth noting that by only implementing (1) then (2) can be achieved in user-space, by interpolating the data with a piece-wise UnivariateSpline.
(3) and (2) are similar if the histogram is sampled at high frequency.

Before this PR is merged I just wanted to check that the scope/names are right between #6466 and this. Does the similarity between (3) and (2) mean that (2) should be included here? If not then I think I think this PR is probably good to go (although I've not reviewed it in detail). Given that I was only intending to cover (1) in #6466 that'll leave (2) without an implementation. Perhaps there needs to be a sampled_gen as well?

@josef-pkt.

@josef-pkt
Copy link
Member

@andyfaff In my opinion adding an explicit histogram distribution does not prevent also adding your more general solution.
As I said before, a histogram distribution is relatively simple and can be fast, or is easier to optimize and has less overhead than a generic class. Also the name is obvious, and might find general use if there are not a huge amount of segments.
(Some api advantage here is that it is frozen by design, i.e. no args, kwargs in the methods.)

The more general #6466 would still be useful if we want an approximation to a smooth density function, e.g. continuous function similar to the existing distributions, or either linear or smooth interpolation with a fine grid or many support points. One problem is that this is more difficult to design.
(Related aside: in statsmodels we ran into design problems for kernel density classes that wanted to have exact smooth and fine grid solution for speed at the same time.)

@thomaskeck
Copy link
Contributor Author

thomaskeck commented Jan 22, 2017

I updated the PR to resolv merge conflicts with the current master,
replaced np.digitize with np.searchsorted (unittests work locally, I have to wait for the automatic checks from travis ci to check if it works).

I also went throught PR checklist (which I should have done earlier)

  • added versionadded to rv_histogram, and argus_gen
  • added myself to thanks.txt

…rrays in unittests, added versionadded to rv_histogram and argus_gen

Fixed pyflake issue.
@ev-br
Copy link
Member

ev-br commented Jan 25, 2017

Ugh, GH online conflict resolution tool is nice, especially for trivial conflicts like here (one line conflict in THANKS.txt), but autogenerated commit message is misleading, 9d1b540
We certainly do not recommend merging master into feature branches.

Edit: It did merge master into the feature branch.

@ev-br ev-br merged commit 63a2f2f into scipy:master Jan 25, 2017
@ev-br
Copy link
Member

ev-br commented Jan 25, 2017

OK, we seem to have converged that this PR does not conflict with gh-6466, and that various forms of interpolated pdfs belong over there, not to this PR. Both Andrew and Josef are in favor, so am I. Merging, thank you @thomaskeck, @josef-pkt @andyfaff

@ev-br ev-br removed the needs-work Items that are pending response from the author label Jan 25, 2017
@ev-br ev-br added this to the 0.19.0 milestone Jan 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants