Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram (revisited) #9

Merged
merged 146 commits into from
Nov 18, 2018

Conversation

LukeMathWalker
Copy link
Member

Based on our discussion in #8, I have revised the implementation.

It needs a testing suite and a couple more helper methods on HistogramCounts but the skeleton it's there. Let me know your thoughts @jturner314

@jturner314
Copy link
Member

jturner314 commented Oct 22, 2018

I think this is a good approach.

A few thoughts:

  • Edges needs to be careful about repeated elements. There are two possible strategies:

    1. Edges can make sure all of the elements are unique, in addition to being in order. The constructor could do this by removing duplicates or by returning an error when there are duplicates. This would make the rest of the implementation simpler but would be less convenient for the user.

    2. Alternatively, it needs to be careful about identifying the correct bin. For example, the current implementation of .indexes() doesn't correctly handle repeated edges, since in the case of repeated elements, .binary_search() can return the index of any of those elements.

      Assuming this strategy isn't much more difficult to implement, I prefer it (to make things simple for the user). It's also worth noting that NumPy uses this strategy.

  • I'd recommend removing the ndim member from HistogramCounts and instead adding a .ndim() method that calls self.counts.ndim(). (This means there are fewer things to keep in-sync.)

    Edit: In the .ndim() method, you could also have a line debug_assert_eq!(counts.ndim(), bins.len());. (Even though this probably isn't necessary, I like using debug assertions like this because they're zero-cost in release mode and may catch bugs as things change in the future.)

  • In the future, it would be worth adding the dimension of counts to the type of HistogramCounts instead of always using ArrayD. This isn't necessary for a first implementation, though.

  • In Histogram #8, I suggested separate HistogramCounts and HistogramDensity types. On further reflection, I don't think a HistogramDensity type is worth adding. Instead, I'd suggest adding a .density() method on HistogramCounts that returns an array of densities. This does mean that if all you want is density, there's a conversion cost from counts -> densities once all the data have been counted, but I think the simplicity is worth the cost (and this approach may in-fact be cheaper anyway). (This also means that we can rename HistogramCounts to just Histogram.)

@LukeMathWalker
Copy link
Member Author

I managed to address all comments (and I fixed the bug you spotted) - they were all on point, thanks for taking the time to go through the PR with that level of attention!

I have also reduced the number of types parameter on BinsBuildingStrategy from 2 to 1 using the strategy you suggested (associated type on trait).

With respect to the issue of bin boundaries I proceeded as follows: all strategies now provide an optimal bin width (either directly through the rule they specify or indirectly recasting the optimal number of bins to an optimal bin width) and we make sure that the last edge is strictly greater than the maximum of the array that has been passed to the builder.
This basically means that we might add an extra bin to the right if it is required to account for the maximum value, but we don't modify the grid parameters to achieve it. What do you think?

@jturner314
Copy link
Member

I managed to address all comments (and I fixed the bug you spotted) - they were all on point, thanks for taking the time to go through the PR with that level of attention!

You're welcome. Thanks for working on this! Would you like me to review the updated version, or do you think it's ready to merge?

This basically means that we might add an extra bin to the right if it is required to account for the maximum value, but we don't modify the grid parameters to achieve it. What do you think?

I think that's fine.

For future versions, I think it would be worth investigating how Julia's StatsBase does things, because StatsBase is similar to our implementation in using all half-open intervals (unlike NumPy, which considers the last bin to be a closed interval).

@LukeMathWalker
Copy link
Member Author

LukeMathWalker commented Nov 14, 2018 via email

src/quantile.rs Outdated Show resolved Hide resolved
@LukeMathWalker LukeMathWalker merged commit fcbe35a into rust-ndarray:master Nov 18, 2018
@LukeMathWalker LukeMathWalker deleted the histogram-w-edges branch November 18, 2018 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants