Histogram (revisited) #9

LukeMathWalker · 2018-10-18T07:41:03Z

Based on our discussion in #8, I have revised the implementation.

It needs a testing suite and a couple more helper methods on HistogramCounts but the skeleton it's there. Let me know your thoughts @jturner314

…s constructor

… in add_dimensions

jturner314 · 2018-10-22T20:47:16Z

I think this is a good approach.

A few thoughts:

Edges needs to be careful about repeated elements. There are two possible strategies:
1. Edges can make sure all of the elements are unique, in addition to being in order. The constructor could do this by removing duplicates or by returning an error when there are duplicates. This would make the rest of the implementation simpler but would be less convenient for the user.
2. Alternatively, it needs to be careful about identifying the correct bin. For example, the current implementation of .indexes() doesn't correctly handle repeated edges, since in the case of repeated elements, .binary_search() can return the index of any of those elements.
  
  Assuming this strategy isn't much more difficult to implement, I prefer it (to make things simple for the user). It's also worth noting that NumPy uses this strategy.
I'd recommend removing the ndim member from HistogramCounts and instead adding a .ndim() method that calls self.counts.ndim(). (This means there are fewer things to keep in-sync.)

Edit: In the .ndim() method, you could also have a line debug_assert_eq!(counts.ndim(), bins.len());. (Even though this probably isn't necessary, I like using debug assertions like this because they're zero-cost in release mode and may catch bugs as things change in the future.)
In the future, it would be worth adding the dimension of counts to the type of HistogramCounts instead of always using ArrayD. This isn't necessary for a first implementation, though.
In Histogram #8, I suggested separate HistogramCounts and HistogramDensity types. On further reflection, I don't think a HistogramDensity type is worth adding. Instead, I'd suggest adding a .density() method on HistogramCounts that returns an array of densities. This does mean that if all you want is density, there's a conversion cost from counts -> densities once all the data have been counted, but I think the simplicity is worth the cost (and this approach may in-fact be cheaper anyway). (This also means that we can rename HistogramCounts to just Histogram.)

…o make sure the maximum does not get dropped

…meters from 2 to 1

LukeMathWalker · 2018-11-13T08:01:22Z

I managed to address all comments (and I fixed the bug you spotted) - they were all on point, thanks for taking the time to go through the PR with that level of attention!

I have also reduced the number of types parameter on BinsBuildingStrategy from 2 to 1 using the strategy you suggested (associated type on trait).

With respect to the issue of bin boundaries I proceeded as follows: all strategies now provide an optimal bin width (either directly through the rule they specify or indirectly recasting the optimal number of bins to an optimal bin width) and we make sure that the last edge is strictly greater than the maximum of the array that has been passed to the builder.
This basically means that we might add an extra bin to the right if it is required to account for the maximum value, but we don't modify the grid parameters to achieve it. What do you think?

jturner314 · 2018-11-13T19:21:27Z

I managed to address all comments (and I fixed the bug you spotted) - they were all on point, thanks for taking the time to go through the PR with that level of attention!

You're welcome. Thanks for working on this! Would you like me to review the updated version, or do you think it's ready to merge?

This basically means that we might add an extra bin to the right if it is required to account for the maximum value, but we don't modify the grid parameters to achieve it. What do you think?

I think that's fine.

For future versions, I think it would be worth investigating how Julia's StatsBase does things, because StatsBase is similar to our implementation in using all half-open intervals (unlike NumPy, which considers the last bin to be a closed interval).

LukeMathWalker · 2018-11-14T09:58:04Z

Let me add a couple of tests and then I think we are good to go!

src/quantile.rs

Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>

LukeMathWalker and others added 28 commits October 9, 2018 08:01

Adding skeleton provided by jturner

03a3d08

Added some IDE-related files

d1e20bf

Adding missing pieces - now it compiles

3cdb3ba

Reusing code of from<Vec> in from<Array1>

3eedbbc

Fixed bugs, better method names, exported methods needed for doc tests.

bbaee7d

Reorganized code in a submodule

7b19061

Created Bins struct - split code between Bins and Edges

acea0a6

Added get method to Bins

b8624a0

Implemented IntoIterator for Edges

cfc59c8

Added doc tests for all methods of Edges

e8535a4

Fixed typos

837cb67

All Bins' methods have been documented

082ed40

Fixed typo

3897b70

Better formulation in docs

6f8d28f

Fixed typo, better wording

9b19ad6

Added short docstring to BinNotFound

d93c5d0

Improved docstring for get

f7f9dc7

HistogramExt trait has been added with a minimal signature

6419da2

Removed trait parameter D from HistogramExt trait signature

94d71a2

Added docstrings to histogram method

41cc373

Implemented histogram method; renamed edges to bins in HistogramCount…

0365480

…s constructor

Exporting HistogramExt trait

6420b3f

Added ndim field to HistogramCounts to implement dimensionality check…

f728a63

… in add_dimensions

Improving docstring

62c46e8

Added docstring to Bins::new

068c893

Removed trailing white line at the end of the file

6d0f7b6

Checked Edges::from methods

c03945e

Checked right-exclusiveness and left-inclusiveness

71c58ff

Edges are now duplicates-free

992f969

LukeMathWalker and others added 15 commits November 12, 2018 08:57

Silence compiler warning

3ed4d3a

Using ? syntax

d265ea0

Add Ok(())

10a5b46

as_view => counts

42f086a

Fixed doc tests

3019d9b

Bumped Rust version to 1.30

8c6411e

Reuse quantile_axis_mut implementation

f09720c

Convert as_slice to as_array_view

3f089a4

Fixed broken tests

bddf12c

Added expected grid

9e432de

Fixed FD

f42dcd7

Fixed doctest

f6b5dbd

Refactored bin strategies - one extra bin is now added to the right t…

004585f

…o make sure the maximum does not get dropped

Added explanation for extra bin to the docs

74723f7

Using an associated type for BinsBuildingStrategy - reduced type para…

a3fc2de

…meters from 2 to 1

Added panics conditions.

5cb53b8

jturner314 reviewed Nov 14, 2018

View reviewed changes

src/quantile.rs Outdated Show resolved Hide resolved

jturner314 and others added 8 commits November 18, 2018 16:16

Update src/quantile.rs

b3c8e0b

Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>

Added test for panic condition

af4a198

For strategies, ask for a reference instead of a view

45cea6a

Test panics for Sqrt

fe943e5

Test Rice panics

bd2570b

Test Sturges panics

9cfc013

Test FreedmanDiaconis panics

77be941

Tested Auto panics

e66f8a6

LukeMathWalker merged commit fcbe35a into rust-ndarray:master Nov 18, 2018

LukeMathWalker deleted the histogram-w-edges branch November 18, 2018 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram (revisited) #9

Histogram (revisited) #9

LukeMathWalker commented Oct 18, 2018

jturner314 commented Oct 22, 2018 •

edited

Loading

LukeMathWalker commented Nov 13, 2018

jturner314 commented Nov 13, 2018

LukeMathWalker commented Nov 14, 2018 via email •

edited by jturner314

Loading

Histogram (revisited) #9

Histogram (revisited) #9

Conversation

LukeMathWalker commented Oct 18, 2018

jturner314 commented Oct 22, 2018 • edited Loading

LukeMathWalker commented Nov 13, 2018

jturner314 commented Nov 13, 2018

LukeMathWalker commented Nov 14, 2018 via email • edited by jturner314 Loading

jturner314 commented Oct 22, 2018 •

edited

Loading

LukeMathWalker commented Nov 14, 2018 via email •

edited by jturner314

Loading