New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats: Histogram distribution #6801
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
I left several minor comments. Bigger comments:
- An instance will need to be added to a test loop,
https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L73 and https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L206 - Moments, possibly also entropy, would need a special-cased implementation, cf https://gist.github.com/ev-br/85fe13b6d17fb51a2196da9da7b3ad2f
- A better name, without a
_gen
would be welcome. Also this should be advertised better than being hidden in a wall of particular distribution instances. Maybe around https://github.com/scipy/scipy/blob/master/scipy/stats/__init__.py#L19
----- | ||
There are no additional shape parameters except for the loc and scale. | ||
The pdf and cdf are defined as stepwise functions from the provided histogram. | ||
In particular the cdf is not interpolated between bin boundaries and not differentiable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is stale, the cdf
is piecewise linear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
data = scipy.stats.norm.rvs(size=100000, loc=0, scale=1.5) | ||
hist = np.histogram(data, bins=100) | ||
template = scipy.stats.histogram_gen(hist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to ditch the generic example (%(example)
) and finish this one, with leading >>>
, plotting and all.
Create a new distribution using the given histogram | ||
@param histogram the return value of np.histogram | ||
""" | ||
self.histogram = histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to add a leading underscore to private attributes (as much as anything's private in python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
histogram
is a user-facing parameter, hence it needs to be documented in the class docstring, in the numpydoc format. It's assumed to be a 2-tuple of 1D array-likes, where the first one is one longer then the second one, correct?
Also need to validate the input a bit. At least check the lengths, wrap pdf
, bins
into np.asarray
in case they are lists etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a leading _ to all the attributes,
and added numpydoc style description of the histogram parameter
""" | ||
self.histogram = histogram | ||
pdf, bins = self.histogram | ||
bin_widths = (np.roll(bins, -1) - bins)[:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is equivalent to bins[1:] - bins[:-1]
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I changed it, bins[1:] - bins[:-1] is more obvious
pdf, bins = self.histogram | ||
bin_widths = (np.roll(bins, -1) - bins)[:-1] | ||
pdf = pdf / float(np.sum(pdf * bin_widths)) | ||
cdf = np.cumsum(pdf * bin_widths)[:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you chop up the last element, and then prepend a one three lines below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this was unnecessary.
""" | ||
PDF of the histogram | ||
""" | ||
return self.template_pdf[np.digitize(x, bins=self.template_bins)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double-check, this does not need special treatment for x=a
or x=b
? I seem to remember interp1d
based search had to be corrected for an edge case. This will need to be explicitly tested, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return 0 for x=b. (so b is already part of the "overflow" bin)
In my opinion this is fine for a continuous distribution.
""" | ||
CDF calculated from the histogram | ||
""" | ||
return np.interp(x, self.template_bins, self.template_cdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also provide _ppf
, by interpolating the inverse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
probabilities = self.template_pdf[1:-1] | ||
choices = np.random.choice(len(self.template_pdf) - 2, size=self._size, p=probabilities / probabilities.sum()) | ||
uniform = np.random.uniform(size=self._size) | ||
return self.template_bins[choices] + uniform * self.template_bin_widths[choices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be implemented in terms of _ppf
(automatically)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an advantage using the automatically provided _rvs implementation?
if I find some time I do a speed comparison, otherwise I would just keep the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing the _ppf
method is worthwhile in itself. Once the _ppf
method is written then there is no point in also having _rvs
because it's well tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented _ppf, but I thought that my _rvs implementation is faster than using the default _rvs implementation, thus I wanted to keep it.
However, I tested it and the default implementation is twice as fast, and the quality of the random numbers is comparable :-)
So I remove the _rvs.
assert_almost_equal(self.template.cdf(8.0), 22.0/25.0) | ||
assert_almost_equal(self.template.cdf(8.5), 23.5/25.0) | ||
assert_almost_equal(self.template.cdf(9.0), 25.0/25.0) | ||
assert_almost_equal(self.template.cdf(10.0), 25.0/25.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to fold this into a single call with a list of values and a list of results. Which would also check vectorized evaluations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I renamed the class to rv_histogram (as in the gist mentioned above). Implemented _munp, _entropy, _ppf. Added a test instance to test_continuous_basics.py. It seems there is a test, which tries to use complex numbers. How do I define that the distributions does not support complex numbers? |
|
||
Behaves like an ordinary scipy rv_continuous distribution | ||
>>> hist_dist.pdf(1.0) | ||
3.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number is a placeholder,
I didn't find out yet, howto run the docstring tests. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The incantation is $ python runtests.py --refguide-check -s stats
. Obvious, isn't it :-)
https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L60 |
'vonmises', 'vonmises_line',]) | ||
'vonmises', 'vonmises_line', 'test_histogram_instance']) | ||
|
||
stats.test_histogram_instance = stats.rv_histogram(np.histogram([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7,7,8,8,9], bins=8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines should be less than 80 chars please.
A couple of style fixes: https://github.com/ev-br/scipy/tree/pr/6801, the relevant commit is 2723ff6 This now looks good to me. |
@ev-br I rebased my branch to yours to include your changes in the pull request. |
@thomaskeck As you're familiar with rebasing, it'd be good to squash your three commits and rewrite the commit message to start with "ENH" (cf numpy dev guide for these prefixes) |
LGTM too |
Actually, test failures seem real, and I somehow did not get those when testing locally. So yeah, status is back to needs-work, would you be able to look at those @thomaskeck |
Last time I checked I wasn't able to reproduce the failures locally as well. |
I can get an error message like that with a 2-D array in digitize, but I don't know why the distribution code doesn't ravel (or squeeze). I have also There might be a distribution internal code path that assumes that
|
https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.digitize.html:
Maybe the easiest thing to do here is to just use |
My original thoughts in this space were that the stats module has the opportunity for arbitrary continuous (I haven't thought about discrete) distributions. Those arbitrary distributions could be specified in one of three related way:
In a WIP a while ago I tried to make a class that would address all 3, one of the comments was that it have been a bit too complicated trying to put them together. Before this PR is merged I just wanted to check that the scope/names are right between #6466 and this. Does the similarity between (3) and (2) mean that (2) should be included here? If not then I think I think this PR is probably good to go (although I've not reviewed it in detail). Given that I was only intending to cover (1) in #6466 that'll leave (2) without an implementation. Perhaps there needs to be a |
@andyfaff In my opinion adding an explicit histogram distribution does not prevent also adding your more general solution. The more general #6466 would still be useful if we want an approximation to a smooth density function, e.g. continuous function similar to the existing distributions, or either linear or smooth interpolation with a fine grid or many support points. One problem is that this is more difficult to design. |
I updated the PR to resolv merge conflicts with the current master, I also went throught PR checklist (which I should have done earlier)
|
…rrays in unittests, added versionadded to rv_histogram and argus_gen Fixed pyflake issue.
Ugh, GH online conflict resolution tool is nice, especially for trivial conflicts like here (one line conflict in THANKS.txt), but autogenerated commit message is misleading, 9d1b540 Edit: It did merge master into the feature branch. |
OK, we seem to have converged that this PR does not conflict with gh-6466, and that various forms of interpolated pdfs belong over there, not to this PR. Both Andrew and Josef are in favor, so am I. Merging, thank you @thomaskeck, @josef-pkt @andyfaff |
histogram_gen can be used to create a scipy distribution out of a given histogram.
This is often used to incorporate distributions which can be measured or simulated. The techniques is sometimes called "template fit". For reference: ROOT implements it using a class called RooHistPdf
https://root.cern.ch/doc/master/classRooHistPdf.html
rv_arbitrary in #6466 provides similar functionality.
However, I think a specialized implementation just for histograms does make sense.
This is a split of, of another PR which contained several other distributions #6795 and was originally named template_gen