New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix incorrect propagation of dtype in Series normalize and other methods #46
Conversation
@@ -306,7 +312,7 @@ def correlate(self, signal, var='s'): | |||
var : str | |||
Variable name if loading from a MAT file | |||
""" | |||
|
|||
# TODO: how to handle dtype here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can assume correlation coefficient(s) should always end up as float64s, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - at the time I added that comment, I was momentarily confused by something dumb. Just updated the PR to specify this as float64.
Left a bunch of a comments on this one. It's definitely something we need, but I had some suggestions for clarification, and at least one idea about refactoring (by moving |
I'm happy with moving
Am with you on the extra complexificationality all this adds to the Series computation methods (and the Images methods as well, though there are few / none of those at present). Here are the alternatives I can think of:
I can take a quick look at the Pandas code. Fundamentally though the problem we've got here is kind of an unusual one - we've got an attribute that we'd like to have set, but looking up the correct value for it is costly (requires a |
Ok, the commits I just added to this PR make a couple big changes:
For binary series files, there were two choices for where to enforce this - either before the files are saved out, so that series binary files (at least, those created by a standard mechanism) will always have floating point data; or at loading time, so that if int data is coming up off disk we'll cast it to a float. I went with the first option, saying that float data is what gets written to disk, on the grounds that by the time the data gets persisted it really is a Series, not some weird intermediate "ints that will get cast to floats" representation. This also means that existing int-valued Series binary data will continue to behave as previously, without any casting slipping in. One can still explicitly cast a Series object back to an int data type using astype - I figure if you explicitly do this, you probably know what you're doing, and wouldn't then be surprised if normalize() and friends give unexpected results. I'm not totally thrilled by this, since there is still a very real potential for weirdness downstream if via some nonstandard execution path one ends up with an int-valued Series object, but the assumption that "all Series are floats" does simplify the implementation of the Series methods. |
Thinking about this a little more... I'm not happy leaving the back door open to int-valued Series like that. If we're going to say "Series have floating-point values", and base method implementations on that assumption, then we should make every effort to ensure that this assumption is valid. In this interpretation, a "binary Series file" with integer values isn't really a valid Series at all. However, we could, as a courtesy and to aid in bringing old data forward to work with newer versions of the code, automatically convert such now-invalid Series data to an equivalent, valid format by casting it to floats. This breaks the assumption that a round trip from disk to in-memory object to disk should generate the same files (since the second time writing to disk we're now writing floats), but if we look at that as a "conversion of old data" step rather than a "this is just what we do when loading integer values" step then I'm more okay will it. Will add another commit later today handling the conversion of ints to floats as they come up off disk. Might also consider prohibiting using "astype" to convert Series data back down to ints, although the default casting="safe" flag should prevent that anyway. |
@industrial-sloth thanks for the updates, I think this is moving in the right direction! I think I have a different take on one top-level issue though. What I was imagining in this plan is that we wouldn't change anything about how we save data to disk, and instead do all the casting of Re: your last point, I think we should allow casting Series data to ints because the |
@freeman-lab yeah, sounds fine to me. I'm obviously a little torn about how best to implement this, and if you've got a strong instinct here I'm happy to go with that. :) |
…nst explicit casting series to non-float
@@ -523,7 +551,7 @@ def blockToBinarySeries(kviter): | |||
overwrite=overwrite) | |||
|
|||
def saveFromStack(self, datapath, outputdirpath, dims, ext="stack", blockSize="150M", datatype='int16', | |||
startidx=None, stopidx=None, overwrite=False): | |||
newdtype='smallfloat', casting='safe', startidx=None, stopidx=None, overwrite=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this write binary series data as floats by default, whereas we were going to stick with ints when writing stack data to series (so long as they are ints in the first place)? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, that right there is a garden-variety screwup. Will fix. Thanks for the catch!
@industrial-sloth LGTM, except for one question I left about types during one of the saving paths that confused me. After checking that, I'll merge it in, but will then do a fair bit of Q&A with our various workflows. |
@freeman-lab ok, have gone back over the SeriesLoader class and made sure it Does The Right Thing wrt data types whether loading straight into memory or saving out binary files. I think I just forgot about the SeriesLoader "save" path earlier when removing the cast to float on writing out series files - I got the equivalent Images pathway, but just missed this one. Anyway, have also added a bunch of data type assertions to tests, both the unit tests here and my quasi-automated conversion test script, and so am reasonably confident this should be solid at this point. Famous last words right? |
Sorry for the delay, LGTM! Great work figuring all this out, I think we've landed on a good solution. |
fix incorrect propagation of dtype in Series normalize and other methods
This PR addresses a bug in
Series.normalize()
and other methods, where thedtype
attribute of the output was being set incorrectly to thedtype
of the input RDD.After this patch, the default behavior for
apply()
and most other methods that can potentially produce output with a dtype different from the input will be to leave this attribute unset, to be lazily determined as needed by making an implicit call tofirst()
when thedtype
attribute is requested.For
normalize()
in particular, the output dtype will now be the smallest floating-point type that can safely store the data without over/underflow, as determined bycommons.smallest_float_type()
. This will be properly set on the output Series, so that no implicitfirst()
call is needed.@freeman-lab, if you're okay with this PR, I can take care of merging it into master from the 0.4.x branch.