Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize binning #466

Merged
merged 18 commits into from Sep 21, 2016
Merged

Generalize binning #466

merged 18 commits into from Sep 21, 2016

Conversation

chrisvam
Copy link
Contributor

@chrisvam chrisvam commented Sep 11, 2016

As I discussed previously with Tom, I realized that the binning specification in the RPhiBinnedStatistic and RadialBinnedStatistic was not as flexible as what was underneath in the scipy binned_statistic code. This PR tries to address that. Over the course of doing this, it became clear that my inheritance wasn't structured correctly, so I tried to fix that.

There are three things that seem non-ideal to me:

  • the way I implement masking (which is not available in scipy.binned_statistic) is imperfect, but it's the best I can see at the moment. see comments in code
  • there are about 10 lines of code that are duplicated in both RPhiBinnedStatistic and RadialBinnedStatistic. I might be able to fix this with some inheritance or fancy python tricks
  • I should test RPhiBinnedStatistic using scipy.binned_statistic_2d

But this PR improves the interface, which is important.

p.s. this time I started out with a fresh fork/clone and did the "magic 4 commands" that Tom sent me to create the feature-branch. So hopefully there will not be git issues this time. Learning.

@codecov-io
Copy link

codecov-io commented Sep 11, 2016

Current coverage is 87.74% (diff: 96.15%)

Merging #466 into master will increase coverage by 0.62%

@@             master       #466   diff @@
==========================================
  Files            37         36     -1   
  Lines          2748       2742     -6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           2394       2406    +12   
+ Misses          354        336    -18   
  Partials          0          0          

Powered by Codecov. Last update 403fda4...1c27939

@jrmlhermitte
Copy link
Contributor

I'm just aware of this code and this is definitely useful for us too :-). (I might try to also carefully review and give feedback tonight if I can)

Maybe mention Tom and Dan using '@' so they see it?
@tacaswell , @danielballan

@@ -72,6 +72,9 @@ def __init__(self, sample, statistic='mean',
minimum and maximum values along each dimension.

"""
if mask is None:
mask = np.ones_like(sample)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to document the convention of is the mask '1-for-good-pixels' or '1-for-bad-pixels' someplace and make sure we are consistent about it.

attn @CJ-Wright

Copy link
Member

@CJ-Wright CJ-Wright Sep 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already documented in the threshold mask you wrote, here

The current convention is 1 is good pixel, 0 is bad. This way img[mask] will result in all the good pixels. FYI, pyFAI has the exact opposite convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the documentation for mask specifying that 0-values are ignored. Interesting that pyFAI does it the other way. To add to what CJ wrote: our convention also allows img*mask to do something natural. I'd be interested to hear what the pyFAI people have to say about the other convention. Maybe we can ask them on our next phone call.


def __call__(self, values):
return super(BinnedStatistic2D, self).__call__(values)


def get_r_phi(shape, origin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use utils.radial_grid and utils.angle_grid rather than add a new function. Drawback is you need two function calls.
Another option is integrate this into utils somehow. Perhaps create a angle_radial_grid? As a placeholder, it could just call the two functions for now and eventually be changed into something more optimized. My vote is the two separate function calls for now. The overhead is small for an object that shouldn't be initialized often.

"""
if mask is None:
mask = np.ones_like(sample)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand (but I may be wrong! :-) ). sample are the bins. How can mask be the same shape as sample (i.e. for 2D bins, then the shape of sample will be (2, dimx, dimy) where dimx, dimy are dimensions of sample)?
How are you testing the code? Are you using real data? I think for something like this, we should have a working example visually tested and posted before accepting this, regardless of the tests.

I'm also happy to write a notebook later using your code, see if it works, if it helps :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that mask is data dependent. I would vote for keeping the mask at a higher level (i.e. binnedstatistic2D)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Julien,

I think you're correct that my mask shape was wrong. If sample has shape (N,D) then I think mask should have shape (N). e.g. if one is producing a projection onto 2D r-phi, the "bad pixel" mask should be the same for both r and phi. I was testing the 1D mask (with help from Tom) by comparing with the output with scipy's binned_statistic, but I wasn't testing the 2D case. I will try to add that, comparing with binned_statistic_2d.

Thanks for all the ideas. Appreciated...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also meant to add: I started out trying to keep the mask at a higher level, but when I generalized the binning to behave more like the underlying scipy code, it became more difficult. Even now the way I'm doing the mask is not ideal, but at least the interface is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latest commit tests the 2D RPhiBinnedStatistic case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I agree with you and thought about it later. the masking probably makes most sense at the lower level, since it could be generalized to anything (and binnedstatistic expects the incoming data to be of matching shape as bins, so masking then follows naturally there).

I want to also test your mask with my code later (it might help me better understand the code), when I have access to my machine, maybe I might have some advice to add (but can't now). Yeah, we're all busy multitasking :-).

@tacaswell
Copy link
Member

there are about 10 lines of code that are duplicated in both RPhiBinnedStatistic and RadialBinnedStatistic. I might be able to fix this with some inheritance or fancy python tricks

I would take code repetition over getting too fancy.

@@ -23,51 +24,112 @@ def setup(self):
colarr = np.arange(colsize)
rowgrid, colgrid = np.meshgrid(rowarr, colarr, indexing='ij')
self.rgrid = np.sqrt(rowgrid**2 + colgrid**2)
self.phigrid = np.arctan2(colgrid, rowgrid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the test should also use radial_grid and angle_grid. Change this to:

    from skbeam.core.utils import angle_grid, radial_grid
    self.rgrid = radial_grid((0,0),(rowsize, colsize))
    self.phigrid = angle_grid((0,0),(rowsize, colsize))

I think once you do this the the previous suggestion of using angle_grid and radial_grid should work. I fetched your code and made those two changes and it worked for me :-). I think this will help us keep the internal functions self-consistent. What do you think?

PS: all import statements suggested should be moved to the top of the file as well. I think it helps better understand the dependencies. (@tacaswell unless you think otherwise? I don't want to suggest anything incompatible with the general philosophy )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'll have to think about this. That suggests that using the radial_grid/phi_grid is fundamentally changing the grid compared to what I had previously. Ideally I would like to use your calls, but keep the grid identical to what it was previously so that the definition of phi/r don't change, for example. But perhaps I'm missing something, so feel free to disagree...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I can tell, it seems to be just a numerical precision error (I checked the arrays that resulted during the test), and it seems to be a difference in the 16th decimal digit or so. For example, replacing the test with:
assert_array_almost_equal(ref, binned,decimal=10)
will pass.

So under the hood, I would be tempted to say that the calculation is not exactly the same in terms of numerical operations, but the general API is not broken and what we expect it to do has not changed.
So I think one option could be to leave the code as is and use assert_array_almost_equal in the test. What do you think?

scratch out this comment. I was testing the wrong version of the code. It indeed fails, need to think more carefully why

rrange = (rpix.min(), rpix.max())
if phirange is None:
phirange = (-np.pi, np.pi)
self.expected_shape = shape
if mask is not None:
if mask.shape != self.expected_shape:
raise ValueError('"mask" has incorrect shape. '
' Expected: ' + str(self.expected_shape) +
' Received: ' + str(mask.shape))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the if statement put:

mask = mask.reshape(-1)

bins=(rbins, phibins),
range=(rrange, phirange))
bins=bins,
mask=mask.reshape(-1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to
mask=mask

@jrmlhermitte
Copy link
Contributor

Ok, I ran a quick test on the 2D binning using scipy and the new routine here. I fetched @chrisvam 's code first. I had to make a slight modification on a line mentioned for it to work with when mask= None. (There may be better ways to do this)

The results with scipy agree without mask but not with mask. See two notebooks I tried (we will be using this code so we're testing already :-) ):
no mask:
https://github.com/ordirules/notebooks/blob/master/test_BinnedStatistic2D_nomask.ipynb

with mask:
https://github.com/ordirules/notebooks/blob/master/test_BinnedStatistic2D_withmask.ipynb

You won't be able to run the notebook but can you see the results?
I'm not sure what the best way is of testing/presenting results but uploading a notebook to my own git repo then sharing the link seems to work.

Going through now, and might have some other suggestions for masking @chrisvam , I could possibly send you a pull request with suggested changes and you can review them and decide to merge, would that be okay? This is great code and a very useful and needed feature for scikit-beam, thanks for putting this in!

@chrisvam
Copy link
Contributor Author

Hi Julien,

Yeah, that would be great if you had a PR that I could incorporate.

I think it makes sense that 'mean' doesn't agree with scipy, because the way I'm implementing the mask currently is to "move" the masked pixel into an overflow bin (this is not ideal, using the bincount weights argument would probably be better). But no matter how the mask is implemented, it will change the result of 'mean' vs. scipy, because some pixels have been moved: we are doing something fundamentally different. 'sum' however should work, since for that operation setting a pixel to zero is the same as removing it entirely. That's why in my test code I only compare 'sum' to scipy when using the mask. Would that explain the difference you see?

@jrmlhermitte
Copy link
Contributor

@chrisvam nevermind I got them to agree (the way I did it, mean should have worked). The issue had to do with something silly. I no longer specified all the r,phi coordinates but the parts that were not masked, which effectively changed the bin_edges that scipy binned_stats computed. To fix that, rather than give bin numbers, I supplied the same bin edges calculated by BinnedStatistic to scipy binned_stats. See here:
https://github.com/ordirules/notebooks/blob/master/test_BinnedStatistic2D_withmask.ipynb

Anyway, as far as I can tell, they agree exactly so your trick works just fine, and makes the code look intuitive. Personally, I feel that it might be better to save the mask indexes (just to reduce computation time) and only compute binned_statistic on these selected values. But given the complexity of the code perhaps we could take it one step at a time and first work on merging this and running timing/performance tests on different versions of mask indexing later?

I will send a suggested PR a little later and feel free to accept/reject the changes (it may be better to incorporate them manually one by one than just clicking merge). I look forward to using this code, it's much needed in the skbeam community!

@chrisvam
Copy link
Contributor Author

That all sounds good, Julien. If you send me a PR with your proposed changes, that would be awesome. Thank you very much for looking at this ... the code is a little bit fancy, so great to have a second person looking at it.

@jrmlhermitte
Copy link
Contributor

jrmlhermitte commented Sep 18, 2016

sounds good started looking at it again now, just one question before I submit a PR to you. I found the problem why get_r_phi didn't agree with radial_grid and angle_grid. The problem lies in angle_grid (and not radial_grid). You can test by using radial_grid but not angle_grid (and use get_r_phi for that).

We make two different fundamental assumptions here:

self.phigrid = np.arctan2(colgrid, rowgrid)

I would argue that this should be the other way around, the row grid (going up and down in the array, assuming the row goes up and down and col goes left and right of course) should be the 'y' and column grid should be 'x'. Would this be a suitable change of assumptions? I think accepting the change to:

self.phigrid = np.arctan2(rowgrid, colgrid)

would make the code compatible with angle_grid in a nice way. What do you think?
(agreed, this code is involved, nice to have multiple people stabbing at it!)

@chrisvam
Copy link
Contributor Author

I would have one request, Julien, if it's not too much trouble: I'm trying to get this PR in so that I can start to try to really "sell" skbeam to LCLS users, so they start getting used to the idea and hopefully excited about it (along with the histogramming code we worked on). So if you get me your PR soon-ish (some number of days?) I would be grateful. Thanks!

@jrmlhermitte
Copy link
Contributor

Done. Take a look and let us know when done and what you agree or disagree with. I've tested what I sent you, results here:
https://github.com/ordirules/notebooks/blob/master/test_BinnedStatistic2D_withmask.ipynb

then we can summarize for everyone else. Anyway, it works perfect and the API is great, so making the code finer/faster (if at all possible) could be left to another PR.

thanks for submitting this, and I hope it sells well!

@chrisvam
Copy link
Contributor Author

Hi Julien,

Tom/Dan/Eric and I discussed the x/y row/col convention previously. As I understand it, the advantage of having x/y=col/row is that x/y appear natural in a typical imshow plot. The advantage of having x/y=row/col is that coding is more natural with arr[x][y](in my original PR I gave the user the ability to choose between the two, but Tom argued that was too complex, and I agree).

I think the argument at the time was that imshow behavior could be changed with the "origin" keyword, so at the time we opted for the second convention. I have some preference for that second convention since arr[y][x] feels unintuitive, but don't feel super strongly about it. Whatever we choose, however, we should be consistent across all skbeam software. Perhaps @danielballan or @tacaswell wants to chime in with a convention decision?

@jrmlhermitte
Copy link
Contributor

I apologize for running in circles on this topic and thanks for bringing it to focus. I also came from a [x,y] convention in the past (yorick) [http://yorick.sourceforge.net/]. However, in my case, the fastest varying index in the array was still x (i.e. a[x,y] in yorick was equivalent to a[y,x] in numpy for the same physical structure in memory. So both are still accessed as [x + width*y] when the array is collapsed to 1D).
So I would vote for having x being the fastest varying index, hence use the notation [y,x] here. The notation changes when moving to numpy but in my opinion I feel like it makes more intuitive sense). I think thinking it in terms of fastest varying index (and maybe documenting so) could help with sharing the data across platforms (C).

Any other thoughts? I could be wrong about this and I also would like some input as to what is a good way to think about this. Thanks!

# Apply mask in a non-ideal way by setting value outside range.
# Would be better to do this using bincount "weights", perhaps.
thissample = sample[:, i]
thissample[mask == 0] = (self.edges[i][0] -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest putting the mask test here. If there is no mask you will do a whole bunch of work to not do any work. if mask is not None is cheaper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have added this in the most recent commit. Also included @ordirules suggestion of using radial_grid/angle_grid and updated phi-definition documentation accordingly.

@tacaswell
Copy link
Member

@ordirules I do not follow you about C vs numpy. Both are row-major and have the fastest axes last in the [] notation.

These sorts of x/y vs i/j vs row/col issues have been a constant headache for me 😞

It is entirely possible that angle_grid is wrong, I will need to look at plots of them to check.

@jrmlhermitte
Copy link
Contributor

jrmlhermitte commented Sep 19, 2016

@tacaswell I'm sorry my mistake, too many languages. Indeed C and numpy share the same convention (and matlab as far as I can see online). However, yorick has yet another convention. Its fastest varying index is actually the leftermost index! Yet more evidence to support the headache :-( (see here , second line)

Sounds good, angle_grid looks fine to me, but def worth scrutiny. In the meantime, I think I know what @danielballan 's vote might be, git blame says he's the author of this ;-) (in angle_grid definition)

    # row is y, column is x. "so say we all. amen."
    x, y = np.meshgrid(pixel_size[1] * (np.arange(shape[1]) -
                                        center[1]),
                       pixel_size[0] * (np.arange(shape[0]) -
                                        center[0]))
    return np.arctan2(y, x)

@danielballan
Copy link
Member

Thanks, @chrisvam. @tacaswell and @ordirules and I are meeting in person and are happy with this. Thanks for going through the revisions.

@danielballan danielballan merged commit 4e47503 into scikit-beam:master Sep 21, 2016
@@ -434,36 +458,26 @@ def __init__(self, rbins, phibins, rowsize, colsize,
will be called on the values in each bin. Empty bins will be
represented by function([]), or NaN if this returns an error.
"""
if origin is None:
origin = shape[0]//2, shape[1]//2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisvam @danielballan @tacaswell

sorry to reference a merged PR, but I forgot to mention this. I think this should rather be:

origin = shape[0]/2., shape[1]/2.

the reason being that origin may be a float (so we can exactly center on the image). In either case, it's just a default (and I guess a BikeShed).

Would there be objections to changing this default?
( I would submit a new PR of course)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have detectors with odd numbers of rows or columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more for simulations. I generate lots of simulations to predict/confirm certain results and I do play with the parity of pixel number just to test for bugs.

In either case, I can just manually supply the center. I figured I'd ask in case others would agree with this convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean in general I don't have any issue with it. I was mostly curious as to how frequently that issue would be hit.

nit: you don't need the . after the 2 since we are importing division from __future__ which makes division with / always do float division

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can't hurt, so it would be fine with me. People in principle could run this algorithm on a region-of-interest which had an odd number of rows/cols, for example. I'm happy to do a PR or you can. Totally up to you...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure then, let's do it before it becomes used routinely. If you could submit that would be great!
correction, I also meant (if index refers to pixel center for 2 based indexing, then true center should be this):

origin = (shape[0]-1)/2, (shape[1]-1)/2

@ericdill thanks for the info, nice seeing you on skbeam :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants