Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix incorrect propagation of dtype in Series normalize and other methods #46

Merged
merged 9 commits into from Nov 15, 2014

Conversation

industrial-sloth
Copy link
Contributor

This PR addresses a bug in Series.normalize() and other methods, where the dtype attribute of the output was being set incorrectly to the dtype of the input RDD.

After this patch, the default behavior for apply() and most other methods that can potentially produce output with a dtype different from the input will be to leave this attribute unset, to be lazily determined as needed by making an implicit call to first() when the dtype attribute is requested.

For normalize() in particular, the output dtype will now be the smallest floating-point type that can safely store the data without over/underflow, as determined by commons.smallest_float_type(). This will be properly set on the output Series, so that no implicit first() call is needed.

@freeman-lab, if you're okay with this PR, I can take care of merging it into master from the 0.4.x branch.

@@ -306,7 +312,7 @@ def correlate(self, signal, var='s'):
var : str
Variable name if loading from a MAT file
"""

# TODO: how to handle dtype here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can assume correlation coefficient(s) should always end up as float64s, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - at the time I added that comment, I was momentarily confused by something dumb. Just updated the PR to specify this as float64.

@freeman-lab
Copy link
Member

Left a bunch of a comments on this one. It's definitely something we need, but I had some suggestions for clarification, and at least one idea about refactoring (by moving apply to Data instead of Series). I'm a little worried that it adds overhead for anyone wanting to contribute new Series methods, even very simple ones, but probably worth that cost. We might want to look again at how Pandas handles this same scenario.

@industrial-sloth
Copy link
Contributor Author

I'm happy with moving apply() up to Data, and can get that taken care of. Other things to address:

  • I'll see if I can clear up the nopropagate docstring comment regarding underscores, and will correct the "from self" / "to self" confusion there.
  • I do not yet love any alternatives for the name of the current expectedDtype parameter on apply(), including expectedDtype itself. Were it not for the implication that it would lead to an explicit casting of the underlying numpy data, I'd say just call it dtype... not sure what to do here.
  • for things like seriesMean and the other series stats functions, correlate, some others, we can go ahead and propagate through the output dtype, rather than leaving it lazily set as at present. This is modestly aggravating but not difficult.

Am with you on the extra complexificationality all this adds to the Series computation methods (and the Images methods as well, though there are few / none of those at present). Here are the alternatives I can think of:

  • drop the __finalize__ logic from methods where a dtype may change, and instead just pass all attributes explicitly in the constructor. This I think is a nonstarter, because any new attributes added in a subclass wouldn't be propagated forward.
  • Leave the _dtype attribute unset in general, to be lazily determined by a first() call. This ends up looking much like the code in this current PR, which does exactly this in most cases. The additional complexity is limited to a somewhat obscure additional nopropagate=('_dtype',) argument.
  • Something else that I haven't thought of. :)

I can take a quick look at the Pandas code. Fundamentally though the problem we've got here is kind of an unusual one - we've got an attribute that we'd like to have set, but looking up the correct value for it is costly (requires a first() call). So we either defer this cost via lazy resolution, or we add in some additional logic to figure out / enforce what the value should be (casting to smallest float, etc). Basically I guess it's an efficiency versus complexity tradeoff.

@industrial-sloth
Copy link
Contributor Author

Ok, the commits I just added to this PR make a couple big changes:

  • as we discussed, the type checking and casting within the series methods has been removed;
  • instead, we now enforce that Series will be floating-point valued at the time of Series generation.

For binary series files, there were two choices for where to enforce this - either before the files are saved out, so that series binary files (at least, those created by a standard mechanism) will always have floating point data; or at loading time, so that if int data is coming up off disk we'll cast it to a float. I went with the first option, saying that float data is what gets written to disk, on the grounds that by the time the data gets persisted it really is a Series, not some weird intermediate "ints that will get cast to floats" representation. This also means that existing int-valued Series binary data will continue to behave as previously, without any casting slipping in.

One can still explicitly cast a Series object back to an int data type using astype - I figure if you explicitly do this, you probably know what you're doing, and wouldn't then be surprised if normalize() and friends give unexpected results.

I'm not totally thrilled by this, since there is still a very real potential for weirdness downstream if via some nonstandard execution path one ends up with an int-valued Series object, but the assumption that "all Series are floats" does simplify the implementation of the Series methods.

@industrial-sloth
Copy link
Contributor Author

Thinking about this a little more... I'm not happy leaving the back door open to int-valued Series like that. If we're going to say "Series have floating-point values", and base method implementations on that assumption, then we should make every effort to ensure that this assumption is valid.

In this interpretation, a "binary Series file" with integer values isn't really a valid Series at all. However, we could, as a courtesy and to aid in bringing old data forward to work with newer versions of the code, automatically convert such now-invalid Series data to an equivalent, valid format by casting it to floats. This breaks the assumption that a round trip from disk to in-memory object to disk should generate the same files (since the second time writing to disk we're now writing floats), but if we look at that as a "conversion of old data" step rather than a "this is just what we do when loading integer values" step then I'm more okay will it.

Will add another commit later today handling the conversion of ints to floats as they come up off disk. Might also consider prohibiting using "astype" to convert Series data back down to ints, although the default casting="safe" flag should prevent that anyway.

@freeman-lab
Copy link
Member

@industrial-sloth thanks for the updates, I think this is moving in the right direction! I think I have a different take on one top-level issue though. What I was imagining in this plan is that we wouldn't change anything about how we save data to disk, and instead do all the casting of Series data to floats at load time. Why? Storing data is expensive, and the gain of storing integer valued Series data as floats, that it matches the in-memory definition, doesn't seem worth the significant cost. Again, examining all usage patterns thus far, the input data for Series have been integers or floats (and it's nice to store the former compactly), but they almost always need to be turned into floats for any analysis, so we're now just doing that at load time.

Re: your last point, I think we should allow casting Series data to ints because the casting='safe' should prevent it from happening unless the user really knows what she's doing.

@industrial-sloth
Copy link
Contributor Author

@freeman-lab yeah, sounds fine to me. I'm obviously a little torn about how best to implement this, and if you've got a strong instinct here I'm happy to go with that. :)
Am getting another commit cued up now that removes the casting to floats as Series go out to disk, and removes the active "throw an error if you're trying to cast a Series to not-float" check.

@@ -523,7 +551,7 @@ def blockToBinarySeries(kviter):
overwrite=overwrite)

def saveFromStack(self, datapath, outputdirpath, dims, ext="stack", blockSize="150M", datatype='int16',
startidx=None, stopidx=None, overwrite=False):
newdtype='smallfloat', casting='safe', startidx=None, stopidx=None, overwrite=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this write binary series data as floats by default, whereas we were going to stick with ints when writing stack data to series (so long as they are ints in the first place)? Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that right there is a garden-variety screwup. Will fix. Thanks for the catch!

@freeman-lab
Copy link
Member

@industrial-sloth LGTM, except for one question I left about types during one of the saving paths that confused me. After checking that, I'll merge it in, but will then do a fair bit of Q&A with our various workflows.

@industrial-sloth
Copy link
Contributor Author

@freeman-lab ok, have gone back over the SeriesLoader class and made sure it Does The Right Thing wrt data types whether loading straight into memory or saving out binary files. I think I just forgot about the SeriesLoader "save" path earlier when removing the cast to float on writing out series files - I got the equivalent Images pathway, but just missed this one. Anyway, have also added a bunch of data type assertions to tests, both the unit tests here and my quasi-automated conversion test script, and so am reasonably confident this should be solid at this point. Famous last words right?

@freeman-lab
Copy link
Member

Sorry for the delay, LGTM! Great work figuring all this out, I think we've landed on a good solution.

freeman-lab added a commit that referenced this pull request Nov 15, 2014
fix incorrect propagation of dtype in Series normalize and other methods
@freeman-lab freeman-lab merged commit 92f44d2 into thunder-project:0.4.x Nov 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants