Generalize binning #466

chrisvam · 2016-09-11T19:13:30Z

As I discussed previously with Tom, I realized that the binning specification in the RPhiBinnedStatistic and RadialBinnedStatistic was not as flexible as what was underneath in the scipy binned_statistic code. This PR tries to address that. Over the course of doing this, it became clear that my inheritance wasn't structured correctly, so I tried to fix that.

There are three things that seem non-ideal to me:

the way I implement masking (which is not available in scipy.binned_statistic) is imperfect, but it's the best I can see at the moment. see comments in code
there are about 10 lines of code that are duplicated in both RPhiBinnedStatistic and RadialBinnedStatistic. I might be able to fix this with some inheritance or fancy python tricks
I should test RPhiBinnedStatistic using scipy.binned_statistic_2d

But this PR improves the interface, which is important.

p.s. this time I started out with a fresh fork/clone and did the "magic 4 commands" that Tom sent me to create the feature-branch. So hopefully there will not be git issues this time. Learning.

codecov-io · 2016-09-11T19:23:15Z

Current coverage is 87.74% (diff: 96.15%)

Merging #466 into master will increase coverage by 0.62%

@@             master       #466   diff @@
==========================================
  Files            37         36     -1   
  Lines          2748       2742     -6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           2394       2406    +12   
+ Misses          354        336    -18   
  Partials          0          0

Powered by Codecov. Last update 403fda4...1c27939

jrmlhermitte · 2016-09-13T15:53:24Z

I'm just aware of this code and this is definitely useful for us too :-). (I might try to also carefully review and give feedback tonight if I can)

Maybe mention Tom and Dan using '@' so they see it?
@tacaswell , @danielballan

tacaswell · 2016-09-13T16:31:43Z

skbeam/core/accumulators/binned_statistic.py

@@ -72,6 +72,9 @@ def __init__(self, sample, statistic='mean',
            minimum and maximum values along each dimension.

        """
+        if mask is None:
+            mask = np.ones_like(sample)


We need to document the convention of is the mask '1-for-good-pixels' or '1-for-bad-pixels' someplace and make sure we are consistent about it.

attn @CJ-Wright

This is already documented in the threshold mask you wrote, here

The current convention is 1 is good pixel, 0 is bad. This way img[mask] will result in all the good pixels. FYI, pyFAI has the exact opposite convention.

I have updated the documentation for mask specifying that 0-values are ignored. Interesting that pyFAI does it the other way. To add to what CJ wrote: our convention also allows img*mask to do something natural. I'd be interested to hear what the pyFAI people have to say about the other convention. Maybe we can ask them on our next phone call.

jrmlhermitte · 2016-09-14T01:16:49Z

skbeam/core/accumulators/binned_statistic.py


    def __call__(self, values):
        return super(BinnedStatistic2D, self).__call__(values)


+def get_r_phi(shape, origin):


I would use utils.radial_grid and utils.angle_grid rather than add a new function. Drawback is you need two function calls.
Another option is integrate this into utils somehow. Perhaps create a angle_radial_grid? As a placeholder, it could just call the two functions for now and eventually be changed into something more optimized. My vote is the two separate function calls for now. The overhead is small for an object that shouldn't be initialized often.

jrmlhermitte · 2016-09-15T03:01:09Z

skbeam/core/accumulators/binned_statistic.py

        """
+        if mask is None:
+            mask = np.ones_like(sample)


I don't understand (but I may be wrong! :-) ). sample are the bins. How can mask be the same shape as sample (i.e. for 2D bins, then the shape of sample will be (2, dimx, dimy) where dimx, dimy are dimensions of sample)?
How are you testing the code? Are you using real data? I think for something like this, we should have a working example visually tested and posted before accepting this, regardless of the tests.

I'm also happy to write a notebook later using your code, see if it works, if it helps :-)

I also think that mask is data dependent. I would vote for keeping the mask at a higher level (i.e. binnedstatistic2D)

Hi Julien,

I think you're correct that my mask shape was wrong. If sample has shape (N,D) then I think mask should have shape (N). e.g. if one is producing a projection onto 2D r-phi, the "bad pixel" mask should be the same for both r and phi. I was testing the 1D mask (with help from Tom) by comparing with the output with scipy's binned_statistic, but I wasn't testing the 2D case. I will try to add that, comparing with binned_statistic_2d.

Thanks for all the ideas. Appreciated...

I also meant to add: I started out trying to keep the mask at a higher level, but when I generalized the binning to behave more like the underlying scipy code, it became more difficult. Even now the way I'm doing the mask is not ideal, but at least the interface is better.

I think the latest commit tests the 2D RPhiBinnedStatistic case.

yeah I agree with you and thought about it later. the masking probably makes most sense at the lower level, since it could be generalized to anything (and binnedstatistic expects the incoming data to be of matching shape as bins, so masking then follows naturally there).

I want to also test your mask with my code later (it might help me better understand the code), when I have access to my machine, maybe I might have some advice to add (but can't now). Yeah, we're all busy multitasking :-).

tacaswell · 2016-09-16T03:32:02Z

there are about 10 lines of code that are duplicated in both RPhiBinnedStatistic and RadialBinnedStatistic. I might be able to fix this with some inheritance or fancy python tricks

I would take code repetition over getting too fancy.

…Statistic

jrmlhermitte · 2016-09-18T17:18:26Z

skbeam/core/accumulators/tests/test_binned_statistic.py

@@ -23,51 +24,112 @@ def setup(self):
        colarr = np.arange(colsize)
        rowgrid, colgrid = np.meshgrid(rowarr, colarr, indexing='ij')
        self.rgrid = np.sqrt(rowgrid**2 + colgrid**2)
+        self.phigrid = np.arctan2(colgrid, rowgrid)


Actually, I think the test should also use radial_grid and angle_grid. Change this to:

from skbeam.core.utils import angle_grid, radial_grid self.rgrid = radial_grid((0,0),(rowsize, colsize)) self.phigrid = angle_grid((0,0),(rowsize, colsize))

I think once you do this the the previous suggestion of using angle_grid and radial_grid should work. I fetched your code and made those two changes and it worked for me :-). I think this will help us keep the internal functions self-consistent. What do you think?

PS: all import statements suggested should be moved to the top of the file as well. I think it helps better understand the dependencies. (@tacaswell unless you think otherwise? I don't want to suggest anything incompatible with the general philosophy )

Hmm. I'll have to think about this. That suggests that using the radial_grid/phi_grid is fundamentally changing the grid compared to what I had previously. Ideally I would like to use your calls, but keep the grid identical to what it was previously so that the definition of phi/r don't change, for example. But perhaps I'm missing something, so feel free to disagree...

as far as I can tell, it seems to be just a numerical precision error (I checked the arrays that resulted during the test), and it seems to be a difference in the 16th decimal digit or so. For example, replacing the test with:
~~assert_array_almost_equal(ref, binned,decimal=10)~~
~~will pass.~~

So under the hood, I would be tempted to say that the calculation is not exactly the same in terms of numerical operations, but the general API is not broken and what we expect it to do has not changed.
So I think one option could be to leave the code as is and use assert_array_almost_equal in the test. What do you think?

scratch out this comment. I was testing the wrong version of the code. It indeed fails, need to think more carefully why

jrmlhermitte · 2016-09-18T17:49:24Z

skbeam/core/accumulators/binned_statistic.py

-            rrange = (rpix.min(), rpix.max())
-        if phirange is None:
-            phirange = (-np.pi, np.pi)
+        self.expected_shape = shape
        if mask is not None:
            if mask.shape != self.expected_shape:
                raise ValueError('"mask" has incorrect shape. '
                                 ' Expected: ' + str(self.expected_shape) +
                                 ' Received: ' + str(mask.shape))


in the if statement put:

mask = mask.reshape(-1)

jrmlhermitte · 2016-09-18T17:49:57Z

skbeam/core/accumulators/binned_statistic.py

-                                                  bins=(rbins, phibins),
-                                                  range=(rrange, phirange))
+                                                  bins=bins,
+                                                  mask=mask.reshape(-1),


change this to
mask=mask

jrmlhermitte · 2016-09-18T18:15:27Z

Ok, I ran a quick test on the 2D binning using scipy and the new routine here. I fetched @chrisvam 's code first. I had to make a slight modification on a line mentioned for it to work with when mask= None. (There may be better ways to do this)

The results with scipy agree without mask but not with mask. See two notebooks I tried (we will be using this code so we're testing already :-) ):
no mask:
https://github.com/ordirules/notebooks/blob/master/test_BinnedStatistic2D_nomask.ipynb

with mask:
https://github.com/ordirules/notebooks/blob/master/test_BinnedStatistic2D_withmask.ipynb

You won't be able to run the notebook but can you see the results?
I'm not sure what the best way is of testing/presenting results but uploading a notebook to my own git repo then sharing the link seems to work.

Going through now, and might have some other suggestions for masking @chrisvam , I could possibly send you a pull request with suggested changes and you can review them and decide to merge, would that be okay? This is great code and a very useful and needed feature for scikit-beam, thanks for putting this in!

chrisvam · 2016-09-18T20:04:48Z

Hi Julien,

Yeah, that would be great if you had a PR that I could incorporate.

I think it makes sense that 'mean' doesn't agree with scipy, because the way I'm implementing the mask currently is to "move" the masked pixel into an overflow bin (this is not ideal, using the bincount weights argument would probably be better). But no matter how the mask is implemented, it will change the result of 'mean' vs. scipy, because some pixels have been moved: we are doing something fundamentally different. 'sum' however should work, since for that operation setting a pixel to zero is the same as removing it entirely. That's why in my test code I only compare 'sum' to scipy when using the mask. Would that explain the difference you see?

jrmlhermitte · 2016-09-18T20:45:07Z

@chrisvam nevermind I got them to agree (the way I did it, mean should have worked). The issue had to do with something silly. I no longer specified all the r,phi coordinates but the parts that were not masked, which effectively changed the bin_edges that scipy binned_stats computed. To fix that, rather than give bin numbers, I supplied the same bin edges calculated by BinnedStatistic to scipy binned_stats. See here:
https://github.com/ordirules/notebooks/blob/master/test_BinnedStatistic2D_withmask.ipynb

Anyway, as far as I can tell, they agree exactly so your trick works just fine, and makes the code look intuitive. Personally, I feel that it might be better to save the mask indexes (just to reduce computation time) and only compute binned_statistic on these selected values. But given the complexity of the code perhaps we could take it one step at a time and first work on merging this and running timing/performance tests on different versions of mask indexing later?

I will send a suggested PR a little later and feel free to accept/reject the changes (it may be better to incorporate them manually one by one than just clicking merge). I look forward to using this code, it's much needed in the skbeam community!

chrisvam · 2016-09-18T22:54:55Z

That all sounds good, Julien. If you send me a PR with your proposed changes, that would be awesome. Thank you very much for looking at this ... the code is a little bit fancy, so great to have a second person looking at it.

jrmlhermitte · 2016-09-18T23:00:53Z

sounds good started looking at it again now, just one question before I submit a PR to you. I found the problem why get_r_phi didn't agree with radial_grid and angle_grid. The problem lies in angle_grid (and not radial_grid). You can test by using radial_grid but not angle_grid (and use get_r_phi for that).

We make two different fundamental assumptions here:

self.phigrid = np.arctan2(colgrid, rowgrid)

I would argue that this should be the other way around, the row grid (going up and down in the array, assuming the row goes up and down and col goes left and right of course) should be the 'y' and column grid should be 'x'. Would this be a suitable change of assumptions? I think accepting the change to:

self.phigrid = np.arctan2(rowgrid, colgrid)

would make the code compatible with angle_grid in a nice way. What do you think?
(agreed, this code is involved, nice to have multiple people stabbing at it!)

chrisvam · 2016-09-18T23:03:02Z

I would have one request, Julien, if it's not too much trouble: I'm trying to get this PR in so that I can start to try to really "sell" skbeam to LCLS users, so they start getting used to the idea and hopefully excited about it (along with the histogramming code we worked on). So if you get me your PR soon-ish (some number of days?) I would be grateful. Thanks!

jrmlhermitte · 2016-09-18T23:36:40Z

Done. Take a look and let us know when done and what you agree or disagree with. I've tested what I sent you, results here:
https://github.com/ordirules/notebooks/blob/master/test_BinnedStatistic2D_withmask.ipynb

then we can summarize for everyone else. Anyway, it works perfect and the API is great, so making the code finer/faster (if at all possible) could be left to another PR.

thanks for submitting this, and I hope it sells well!

chrisvam · 2016-09-19T02:16:57Z

Hi Julien,

Tom/Dan/Eric and I discussed the x/y row/col convention previously. As I understand it, the advantage of having x/y=col/row is that x/y appear natural in a typical imshow plot. The advantage of having x/y=row/col is that coding is more natural with arr[x][y](in my original PR I gave the user the ability to choose between the two, but Tom argued that was too complex, and I agree).

I think the argument at the time was that imshow behavior could be changed with the "origin" keyword, so at the time we opted for the second convention. I have some preference for that second convention since arr[y][x] feels unintuitive, but don't feel super strongly about it. Whatever we choose, however, we should be consistent across all skbeam software. Perhaps @danielballan or @tacaswell wants to chime in with a convention decision?

jrmlhermitte · 2016-09-19T12:34:02Z

I apologize for running in circles on this topic and thanks for bringing it to focus. I also came from a [x,y] convention in the past (yorick) [http://yorick.sourceforge.net/]. However, in my case, the fastest varying index in the array was still x (i.e. a[x,y] in yorick was equivalent to a[y,x] in numpy for the same physical structure in memory. So both are still accessed as [x + width*y] when the array is collapsed to 1D).
So I would vote for having x being the fastest varying index, hence use the notation [y,x] here. The notation changes when moving to numpy but in my opinion I feel like it makes more intuitive sense). I think thinking it in terms of fastest varying index (and maybe documenting so) could help with sharing the data across platforms (C).

Any other thoughts? I could be wrong about this and I also would like some input as to what is a good way to think about this. Thanks!

tacaswell · 2016-09-19T21:32:15Z

skbeam/core/accumulators/binned_statistic.py

+            # Apply mask in a non-ideal way by setting value outside range.
+            # Would be better to do this using bincount "weights", perhaps.
+            thissample = sample[:, i]
+            thissample[mask == 0] = (self.edges[i][0] -


suggest putting the mask test here. If there is no mask you will do a whole bunch of work to not do any work. if mask is not None is cheaper

Agreed. I have added this in the most recent commit. Also included @ordirules suggestion of using radial_grid/angle_grid and updated phi-definition documentation accordingly.

Chrisbinnedstat

tacaswell · 2016-09-19T21:45:45Z

@ordirules I do not follow you about C vs numpy. Both are row-major and have the fastest axes last in the [] notation.

These sorts of x/y vs i/j vs row/col issues have been a constant headache for me 😞

It is entirely possible that angle_grid is wrong, I will need to look at plots of them to check.

…uggested x=col y=row convention

jrmlhermitte · 2016-09-19T22:53:03Z

@tacaswell I'm sorry my mistake, too many languages. Indeed C and numpy share the same convention (and matlab as far as I can see online). However, yorick has yet another convention. Its fastest varying index is actually the leftermost index! Yet more evidence to support the headache :-( (see here , second line)

Sounds good, angle_grid looks fine to me, but def worth scrutiny. In the meantime, I think I know what @danielballan 's vote might be, git blame says he's the author of this ;-) (in angle_grid definition)

    # row is y, column is x. "so say we all. amen."
    x, y = np.meshgrid(pixel_size[1] * (np.arange(shape[1]) -
                                        center[1]),
                       pixel_size[0] * (np.arange(shape[0]) -
                                        center[0]))
    return np.arctan2(y, x)

danielballan · 2016-09-21T14:23:11Z

Thanks, @chrisvam. @tacaswell and @ordirules and I are meeting in person and are happy with this. Thanks for going through the revisions.

jrmlhermitte · 2016-09-22T14:23:06Z

skbeam/core/accumulators/binned_statistic.py

@@ -434,36 +458,26 @@ def __init__(self, rbins, phibins, rowsize, colsize,
                will be called on the values in each bin.  Empty bins will be
                represented by function([]), or NaN if this returns an error.
        """
+        if origin is None:
+            origin = shape[0]//2, shape[1]//2


@chrisvam @danielballan @tacaswell

sorry to reference a merged PR, but I forgot to mention this. I think this should rather be:

origin = shape[0]/2., shape[1]/2.

the reason being that origin may be a float (so we can exactly center on the image). In either case, it's just a default (and I guess a BikeShed).

Would there be objections to changing this default?
( I would submit a new PR of course)

Do we have detectors with odd numbers of rows or columns?

I was thinking more for simulations. I generate lots of simulations to predict/confirm certain results and I do play with the parity of pixel number just to test for bugs.

In either case, I can just manually supply the center. I figured I'd ask in case others would agree with this convention.

I mean in general I don't have any issue with it. I was mostly curious as to how frequently that issue would be hit.

nit: you don't need the . after the 2 since we are importing division from __future__ which makes division with / always do float division

I think it can't hurt, so it would be fine with me. People in principle could run this algorithm on a region-of-interest which had an odd number of rows/cols, for example. I'm happy to do a PR or you can. Totally up to you...

sure then, let's do it before it becomes used routinely. If you could submit that would be great!
correction, I also meant (if index refers to pixel center for 2 based indexing, then true center should be this):

origin = (shape[0]-1)/2, (shape[1]-1)/2

@ericdill thanks for the info, nice seeing you on skbeam :-)

chrisvam added 8 commits September 10, 2016 13:21

Generalize binning and ranges

9d92b1e

Update documentation

9d8cbf7

RadialBinnedStatistic now inherits from BinnedStatistic1D

6a6e386

Add in mask

ad152f0

Fix bug where mask failed if lower bin edge was zero

691a3ae

Improve documentation, switch to using tuples for shape/origin

d00b547

Improve documentation

734b0e4

Fix bug in call to get_r_phi

d4ce0bb

Fix documentation

b983009

tacaswell reviewed Sep 13, 2016
View reviewed changes

chrisvam added 2 commits September 13, 2016 11:34

Add documentation for new mask field

270efa9

Update documentation for mask field

186bfb4

jrmlhermitte reviewed Sep 14, 2016
View reviewed changes

jrmlhermitte reviewed Sep 15, 2016

View reviewed changes

Update documentation, fix problem with mask, add tests for RPhiBinned…

f2947ca

…Statistic

jrmlhermitte reviewed Sep 18, 2016

View reviewed changes

jrmlhermitte added 4 commits September 18, 2016 19:14

change to use radial_grid, angle_grid, mask call modification

360fa93

also fix mask call to 1D

87f8f0f

moved default values out of angle_grid, radial_grid

427aeab

adding blank line

75680f3

tacaswell reviewed Sep 19, 2016

View reviewed changes

Merge pull request #2 from ordirules/chrisbinnedstat

31ac826

Chrisbinnedstat

Move check for mask=None and update doc for phi to reflect Julien's s…

1c27939

…uggested x=col y=row convention

danielballan mentioned this pull request Sep 21, 2016

Documentation on phi convention #468

Open

danielballan merged commit 4e47503 into scikit-beam:master Sep 21, 2016

jrmlhermitte reviewed Sep 22, 2016

View reviewed changes

jrmlhermitte mentioned this pull request Sep 24, 2016

modified origin default #475

Merged

Generalize binning #466

Generalize binning #466

Conversation

chrisvam commented Sep 11, 2016 • edited

codecov-io commented Sep 11, 2016 • edited

Current coverage is 87.74% (diff: 96.15%)

jrmlhermitte commented Sep 13, 2016

Choose a reason for hiding this comment

CJ-Wright Sep 13, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrmlhermitte Sep 14, 2016 • edited

Choose a reason for hiding this comment

jrmlhermitte Sep 15, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tacaswell commented Sep 16, 2016

jrmlhermitte Sep 18, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrmlhermitte Sep 18, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrmlhermitte commented Sep 18, 2016

chrisvam commented Sep 18, 2016

jrmlhermitte commented Sep 18, 2016

chrisvam commented Sep 18, 2016

jrmlhermitte commented Sep 18, 2016 • edited

chrisvam commented Sep 18, 2016

jrmlhermitte commented Sep 18, 2016

chrisvam commented Sep 19, 2016

jrmlhermitte commented Sep 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tacaswell commented Sep 19, 2016

jrmlhermitte commented Sep 19, 2016 • edited

danielballan commented Sep 21, 2016

jrmlhermitte Sep 22, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrmlhermitte Sep 22, 2016 • edited

Choose a reason for hiding this comment

chrisvam commented Sep 11, 2016 •

edited

codecov-io commented Sep 11, 2016 •

edited

CJ-Wright Sep 13, 2016 •

edited

jrmlhermitte Sep 14, 2016 •

edited

jrmlhermitte Sep 15, 2016 •

edited

jrmlhermitte Sep 18, 2016 •

edited

jrmlhermitte Sep 18, 2016 •

edited

jrmlhermitte commented Sep 18, 2016 •

edited

jrmlhermitte commented Sep 19, 2016 •

edited

jrmlhermitte Sep 22, 2016 •

edited

jrmlhermitte Sep 22, 2016 •

edited