New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize binning #466
Generalize binning #466
Conversation
Current coverage is 87.74% (diff: 96.15%)@@ master #466 diff @@
==========================================
Files 37 36 -1
Lines 2748 2742 -6
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 2394 2406 +12
+ Misses 354 336 -18
Partials 0 0
|
I'm just aware of this code and this is definitely useful for us too :-). (I might try to also carefully review and give feedback tonight if I can) Maybe mention Tom and Dan using '@' so they see it? |
@@ -72,6 +72,9 @@ def __init__(self, sample, statistic='mean', | |||
minimum and maximum values along each dimension. | |||
|
|||
""" | |||
if mask is None: | |||
mask = np.ones_like(sample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to document the convention of is the mask '1-for-good-pixels' or '1-for-bad-pixels' someplace and make sure we are consistent about it.
attn @CJ-Wright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already documented in the threshold mask you wrote, here
The current convention is 1 is good pixel, 0 is bad. This way img[mask]
will result in all the good pixels. FYI, pyFAI has the exact opposite convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the documentation for mask specifying that 0-values are ignored. Interesting that pyFAI does it the other way. To add to what CJ wrote: our convention also allows img*mask to do something natural. I'd be interested to hear what the pyFAI people have to say about the other convention. Maybe we can ask them on our next phone call.
|
||
def __call__(self, values): | ||
return super(BinnedStatistic2D, self).__call__(values) | ||
|
||
|
||
def get_r_phi(shape, origin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use utils.radial_grid
and utils.angle_grid
rather than add a new function. Drawback is you need two function calls.
Another option is integrate this into utils somehow. Perhaps create a angle_radial_grid
? As a placeholder, it could just call the two functions for now and eventually be changed into something more optimized. My vote is the two separate function calls for now. The overhead is small for an object that shouldn't be initialized often.
""" | ||
if mask is None: | ||
mask = np.ones_like(sample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand (but I may be wrong! :-) ). sample
are the bins. How can mask
be the same shape as sample
(i.e. for 2D bins, then the shape of sample
will be (2, dimx, dimy) where dimx, dimy are dimensions of sample)?
How are you testing the code? Are you using real data? I think for something like this, we should have a working example visually tested and posted before accepting this, regardless of the tests.
I'm also happy to write a notebook later using your code, see if it works, if it helps :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that mask is data dependent. I would vote for keeping the mask at a higher level (i.e. binnedstatistic2D)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Julien,
I think you're correct that my mask shape was wrong. If sample has shape (N,D) then I think mask should have shape (N). e.g. if one is producing a projection onto 2D r-phi, the "bad pixel" mask should be the same for both r and phi. I was testing the 1D mask (with help from Tom) by comparing with the output with scipy's binned_statistic, but I wasn't testing the 2D case. I will try to add that, comparing with binned_statistic_2d.
Thanks for all the ideas. Appreciated...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also meant to add: I started out trying to keep the mask at a higher level, but when I generalized the binning to behave more like the underlying scipy code, it became more difficult. Even now the way I'm doing the mask is not ideal, but at least the interface is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latest commit tests the 2D RPhiBinnedStatistic case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I agree with you and thought about it later. the masking probably makes most sense at the lower level, since it could be generalized to anything (and binnedstatistic expects the incoming data to be of matching shape as bins, so masking then follows naturally there).
I want to also test your mask with my code later (it might help me better understand the code), when I have access to my machine, maybe I might have some advice to add (but can't now). Yeah, we're all busy multitasking :-).
I would take code repetition over getting too fancy. |
@@ -23,51 +24,112 @@ def setup(self): | |||
colarr = np.arange(colsize) | |||
rowgrid, colgrid = np.meshgrid(rowarr, colarr, indexing='ij') | |||
self.rgrid = np.sqrt(rowgrid**2 + colgrid**2) | |||
self.phigrid = np.arctan2(colgrid, rowgrid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think the test should also use radial_grid
and angle_grid
. Change this to:
from skbeam.core.utils import angle_grid, radial_grid
self.rgrid = radial_grid((0,0),(rowsize, colsize))
self.phigrid = angle_grid((0,0),(rowsize, colsize))
I think once you do this the the previous suggestion of using angle_grid
and radial_grid
should work. I fetched your code and made those two changes and it worked for me :-). I think this will help us keep the internal functions self-consistent. What do you think?
PS: all import statements suggested should be moved to the top of the file as well. I think it helps better understand the dependencies. (@tacaswell unless you think otherwise? I don't want to suggest anything incompatible with the general philosophy )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'll have to think about this. That suggests that using the radial_grid/phi_grid is fundamentally changing the grid compared to what I had previously. Ideally I would like to use your calls, but keep the grid identical to what it was previously so that the definition of phi/r don't change, for example. But perhaps I'm missing something, so feel free to disagree...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I can tell, it seems to be just a numerical precision error (I checked the arrays that resulted during the test), and it seems to be a difference in the 16th decimal digit or so. For example, replacing the test with:
assert_array_almost_equal(ref, binned,decimal=10)
will pass.
So under the hood, I would be tempted to say that the calculation is not exactly the same in terms of numerical operations, but the general API is not broken and what we expect it to do has not changed.
So I think one option could be to leave the code as is and use assert_array_almost_equal in the test. What do you think?
scratch out this comment. I was testing the wrong version of the code. It indeed fails, need to think more carefully why
rrange = (rpix.min(), rpix.max()) | ||
if phirange is None: | ||
phirange = (-np.pi, np.pi) | ||
self.expected_shape = shape | ||
if mask is not None: | ||
if mask.shape != self.expected_shape: | ||
raise ValueError('"mask" has incorrect shape. ' | ||
' Expected: ' + str(self.expected_shape) + | ||
' Received: ' + str(mask.shape)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the if statement put:
mask = mask.reshape(-1)
bins=(rbins, phibins), | ||
range=(rrange, phirange)) | ||
bins=bins, | ||
mask=mask.reshape(-1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to
mask=mask
Ok, I ran a quick test on the 2D binning using scipy and the new routine here. I fetched @chrisvam 's code first. I had to make a slight modification on a line mentioned for it to work with when The results with scipy agree without mask but not with mask. See two notebooks I tried (we will be using this code so we're testing already :-) ): with mask: You won't be able to run the notebook but can you see the results? Going through now, and might have some other suggestions for masking @chrisvam , I could possibly send you a pull request with suggested changes and you can review them and decide to merge, would that be okay? This is great code and a very useful and needed feature for scikit-beam, thanks for putting this in! |
Hi Julien, Yeah, that would be great if you had a PR that I could incorporate. I think it makes sense that 'mean' doesn't agree with scipy, because the way I'm implementing the mask currently is to "move" the masked pixel into an overflow bin (this is not ideal, using the bincount weights argument would probably be better). But no matter how the mask is implemented, it will change the result of 'mean' vs. scipy, because some pixels have been moved: we are doing something fundamentally different. 'sum' however should work, since for that operation setting a pixel to zero is the same as removing it entirely. That's why in my test code I only compare 'sum' to scipy when using the mask. Would that explain the difference you see? |
@chrisvam nevermind I got them to agree (the way I did it, Anyway, as far as I can tell, they agree exactly so your trick works just fine, and makes the code look intuitive. Personally, I feel that it might be better to save the mask indexes (just to reduce computation time) and only compute binned_statistic on these selected values. But given the complexity of the code perhaps we could take it one step at a time and first work on merging this and running timing/performance tests on different versions of mask indexing later? I will send a suggested PR a little later and feel free to accept/reject the changes (it may be better to incorporate them manually one by one than just clicking merge). I look forward to using this code, it's much needed in the skbeam community! |
That all sounds good, Julien. If you send me a PR with your proposed changes, that would be awesome. Thank you very much for looking at this ... the code is a little bit fancy, so great to have a second person looking at it. |
sounds good started looking at it again now, just one question before I submit a PR to you. I found the problem why We make two different fundamental assumptions here: self.phigrid = np.arctan2(colgrid, rowgrid) I would argue that this should be the other way around, the row grid (going up and down in the array, assuming the row goes up and down and col goes left and right of course) should be the 'y' and column grid should be 'x'. Would this be a suitable change of assumptions? I think accepting the change to: self.phigrid = np.arctan2(rowgrid, colgrid) would make the code compatible with |
I would have one request, Julien, if it's not too much trouble: I'm trying to get this PR in so that I can start to try to really "sell" skbeam to LCLS users, so they start getting used to the idea and hopefully excited about it (along with the histogramming code we worked on). So if you get me your PR soon-ish (some number of days?) I would be grateful. Thanks! |
Done. Take a look and let us know when done and what you agree or disagree with. I've tested what I sent you, results here: then we can summarize for everyone else. Anyway, it works perfect and the API is great, so making the code finer/faster (if at all possible) could be left to another PR. thanks for submitting this, and I hope it sells well! |
Hi Julien, Tom/Dan/Eric and I discussed the x/y row/col convention previously. As I understand it, the advantage of having x/y=col/row is that x/y appear natural in a typical imshow plot. The advantage of having x/y=row/col is that coding is more natural with arr[x][y](in my original PR I gave the user the ability to choose between the two, but Tom argued that was too complex, and I agree). I think the argument at the time was that imshow behavior could be changed with the "origin" keyword, so at the time we opted for the second convention. I have some preference for that second convention since arr[y][x] feels unintuitive, but don't feel super strongly about it. Whatever we choose, however, we should be consistent across all skbeam software. Perhaps @danielballan or @tacaswell wants to chime in with a convention decision? |
I apologize for running in circles on this topic and thanks for bringing it to focus. I also came from a [x,y] convention in the past (yorick) [http://yorick.sourceforge.net/]. However, in my case, the fastest varying index in the array was still x (i.e. a[x,y] in yorick was equivalent to a[y,x] in numpy for the same physical structure in memory. So both are still accessed as [x + width*y] when the array is collapsed to 1D). Any other thoughts? I could be wrong about this and I also would like some input as to what is a good way to think about this. Thanks! |
# Apply mask in a non-ideal way by setting value outside range. | ||
# Would be better to do this using bincount "weights", perhaps. | ||
thissample = sample[:, i] | ||
thissample[mask == 0] = (self.edges[i][0] - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest putting the mask
test here. If there is no mask you will do a whole bunch of work to not do any work. if mask is not None
is cheaper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I have added this in the most recent commit. Also included @ordirules suggestion of using radial_grid/angle_grid and updated phi-definition documentation accordingly.
Chrisbinnedstat
@ordirules I do not follow you about C vs numpy. Both are row-major and have the fastest axes last in the These sorts of x/y vs i/j vs row/col issues have been a constant headache for me 😞 It is entirely possible that angle_grid is wrong, I will need to look at plots of them to check. |
…uggested x=col y=row convention
@tacaswell I'm sorry my mistake, too many languages. Indeed C and numpy share the same convention (and matlab as far as I can see online). However, yorick has yet another convention. Its fastest varying index is actually the leftermost index! Yet more evidence to support the headache :-( (see here , second line) Sounds good, # row is y, column is x. "so say we all. amen."
x, y = np.meshgrid(pixel_size[1] * (np.arange(shape[1]) -
center[1]),
pixel_size[0] * (np.arange(shape[0]) -
center[0]))
return np.arctan2(y, x) |
Thanks, @chrisvam. @tacaswell and @ordirules and I are meeting in person and are happy with this. Thanks for going through the revisions. |
@@ -434,36 +458,26 @@ def __init__(self, rbins, phibins, rowsize, colsize, | |||
will be called on the values in each bin. Empty bins will be | |||
represented by function([]), or NaN if this returns an error. | |||
""" | |||
if origin is None: | |||
origin = shape[0]//2, shape[1]//2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrisvam @danielballan @tacaswell
sorry to reference a merged PR, but I forgot to mention this. I think this should rather be:
origin = shape[0]/2., shape[1]/2.
the reason being that origin may be a float
(so we can exactly center on the image). In either case, it's just a default (and I guess a BikeShed).
Would there be objections to changing this default?
( I would submit a new PR of course)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have detectors with odd numbers of rows or columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more for simulations. I generate lots of simulations to predict/confirm certain results and I do play with the parity of pixel number just to test for bugs.
In either case, I can just manually supply the center. I figured I'd ask in case others would agree with this convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean in general I don't have any issue with it. I was mostly curious as to how frequently that issue would be hit.
nit: you don't need the .
after the 2
since we are importing division
from __future__
which makes division with /
always do float division
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can't hurt, so it would be fine with me. People in principle could run this algorithm on a region-of-interest which had an odd number of rows/cols, for example. I'm happy to do a PR or you can. Totally up to you...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure then, let's do it before it becomes used routinely. If you could submit that would be great!
correction, I also meant (if index refers to pixel center for 2 based indexing, then true center should be this):
origin = (shape[0]-1)/2, (shape[1]-1)/2
@ericdill thanks for the info, nice seeing you on skbeam :-)
As I discussed previously with Tom, I realized that the binning specification in the RPhiBinnedStatistic and RadialBinnedStatistic was not as flexible as what was underneath in the scipy binned_statistic code. This PR tries to address that. Over the course of doing this, it became clear that my inheritance wasn't structured correctly, so I tried to fix that.
There are three things that seem non-ideal to me:
But this PR improves the interface, which is important.
p.s. this time I started out with a fresh fork/clone and did the "magic 4 commands" that Tom sent me to create the feature-branch. So hopefully there will not be git issues this time. Learning.