fix incorrect propagation of dtype in Series normalize and other methods #46

industrial-sloth · 2014-11-06T22:21:54Z

This PR addresses a bug in Series.normalize() and other methods, where the dtype attribute of the output was being set incorrectly to the dtype of the input RDD.

After this patch, the default behavior for apply() and most other methods that can potentially produce output with a dtype different from the input will be to leave this attribute unset, to be lazily determined as needed by making an implicit call to first() when the dtype attribute is requested.

For normalize() in particular, the output dtype will now be the smallest floating-point type that can safely store the data without over/underflow, as determined by commons.smallest_float_type(). This will be properly set on the output Series, so that no implicit first() call is needed.

@freeman-lab, if you're okay with this PR, I can take care of merging it into master from the 0.4.x branch.

freeman-lab · 2014-11-07T07:02:55Z

python/thunder/rdds/series.py

@@ -306,7 +312,7 @@ def correlate(self, signal, var='s'):
        var : str
            Variable name if loading from a MAT file
        """
-
+        # TODO: how to handle dtype here?


I think we can assume correlation coefficient(s) should always end up as float64s, right?

Yeah - at the time I added that comment, I was momentarily confused by something dumb. Just updated the PR to specify this as float64.

freeman-lab · 2014-11-07T07:21:30Z

Left a bunch of a comments on this one. It's definitely something we need, but I had some suggestions for clarification, and at least one idea about refactoring (by moving apply to Data instead of Series). I'm a little worried that it adds overhead for anyone wanting to contribute new Series methods, even very simple ones, but probably worth that cost. We might want to look again at how Pandas handles this same scenario.

industrial-sloth · 2014-11-07T12:59:40Z

I'm happy with moving apply() up to Data, and can get that taken care of. Other things to address:

I'll see if I can clear up the nopropagate docstring comment regarding underscores, and will correct the "from self" / "to self" confusion there.
I do not yet love any alternatives for the name of the current expectedDtype parameter on apply(), including expectedDtype itself. Were it not for the implication that it would lead to an explicit casting of the underlying numpy data, I'd say just call it dtype... not sure what to do here.
for things like seriesMean and the other series stats functions, correlate, some others, we can go ahead and propagate through the output dtype, rather than leaving it lazily set as at present. This is modestly aggravating but not difficult.

Am with you on the extra complexificationality all this adds to the Series computation methods (and the Images methods as well, though there are few / none of those at present). Here are the alternatives I can think of:

drop the __finalize__ logic from methods where a dtype may change, and instead just pass all attributes explicitly in the constructor. This I think is a nonstarter, because any new attributes added in a subclass wouldn't be propagated forward.
Leave the _dtype attribute unset in general, to be lazily determined by a first() call. This ends up looking much like the code in this current PR, which does exactly this in most cases. The additional complexity is limited to a somewhat obscure additional nopropagate=('_dtype',) argument.
Something else that I haven't thought of. :)

I can take a quick look at the Pandas code. Fundamentally though the problem we've got here is kind of an unusual one - we've got an attribute that we'd like to have set, but looking up the correct value for it is costly (requires a first() call). So we either defer this cost via lazy resolution, or we add in some additional logic to figure out / enforce what the value should be (casting to smallest float, etc). Basically I guess it's an efficiency versus complexity tradeoff.

…hod docstring changes

industrial-sloth · 2014-11-09T16:32:01Z

Ok, the commits I just added to this PR make a couple big changes:

as we discussed, the type checking and casting within the series methods has been removed;
instead, we now enforce that Series will be floating-point valued at the time of Series generation.

For binary series files, there were two choices for where to enforce this - either before the files are saved out, so that series binary files (at least, those created by a standard mechanism) will always have floating point data; or at loading time, so that if int data is coming up off disk we'll cast it to a float. I went with the first option, saying that float data is what gets written to disk, on the grounds that by the time the data gets persisted it really is a Series, not some weird intermediate "ints that will get cast to floats" representation. This also means that existing int-valued Series binary data will continue to behave as previously, without any casting slipping in.

One can still explicitly cast a Series object back to an int data type using astype - I figure if you explicitly do this, you probably know what you're doing, and wouldn't then be surprised if normalize() and friends give unexpected results.

I'm not totally thrilled by this, since there is still a very real potential for weirdness downstream if via some nonstandard execution path one ends up with an int-valued Series object, but the assumption that "all Series are floats" does simplify the implementation of the Series methods.

industrial-sloth · 2014-11-09T16:57:58Z

Thinking about this a little more... I'm not happy leaving the back door open to int-valued Series like that. If we're going to say "Series have floating-point values", and base method implementations on that assumption, then we should make every effort to ensure that this assumption is valid.

In this interpretation, a "binary Series file" with integer values isn't really a valid Series at all. However, we could, as a courtesy and to aid in bringing old data forward to work with newer versions of the code, automatically convert such now-invalid Series data to an equivalent, valid format by casting it to floats. This breaks the assumption that a round trip from disk to in-memory object to disk should generate the same files (since the second time writing to disk we're now writing floats), but if we look at that as a "conversion of old data" step rather than a "this is just what we do when loading integer values" step then I'm more okay will it.

Will add another commit later today handling the conversion of ints to floats as they come up off disk. Might also consider prohibiting using "astype" to convert Series data back down to ints, although the default casting="safe" flag should prevent that anyway.

…nt via astype

freeman-lab · 2014-11-09T20:43:11Z

@industrial-sloth thanks for the updates, I think this is moving in the right direction! I think I have a different take on one top-level issue though. What I was imagining in this plan is that we wouldn't change anything about how we save data to disk, and instead do all the casting of Series data to floats at load time. Why? Storing data is expensive, and the gain of storing integer valued Series data as floats, that it matches the in-memory definition, doesn't seem worth the significant cost. Again, examining all usage patterns thus far, the input data for Series have been integers or floats (and it's nice to store the former compactly), but they almost always need to be turned into floats for any analysis, so we're now just doing that at load time.

Re: your last point, I think we should allow casting Series data to ints because the casting='safe' should prevent it from happening unless the user really knows what she's doing.

industrial-sloth · 2014-11-09T21:12:25Z

@freeman-lab yeah, sounds fine to me. I'm obviously a little torn about how best to implement this, and if you've got a strong instinct here I'm happy to go with that. :)
Am getting another commit cued up now that removes the casting to floats as Series go out to disk, and removes the active "throw an error if you're trying to cast a Series to not-float" check.

…nst explicit casting series to non-float

freeman-lab · 2014-11-11T19:46:09Z

python/thunder/rdds/fileio/seriesloader.py

@@ -523,7 +551,7 @@ def blockToBinarySeries(kviter):
                          overwrite=overwrite)

    def saveFromStack(self, datapath, outputdirpath, dims, ext="stack", blockSize="150M", datatype='int16',
-                      startidx=None, stopidx=None, overwrite=False):
+                      newdtype='smallfloat', casting='safe', startidx=None, stopidx=None, overwrite=False):


Doesn't this write binary series data as floats by default, whereas we were going to stick with ints when writing stack data to series (so long as they are ints in the first place)? Or am I missing something?

Nope, that right there is a garden-variety screwup. Will fix. Thanks for the catch!

freeman-lab · 2014-11-11T19:50:17Z

@industrial-sloth LGTM, except for one question I left about types during one of the saving paths that confused me. After checking that, I'll merge it in, but will then do a fair bit of Q&A with our various workflows.

industrial-sloth · 2014-11-11T23:32:25Z

@freeman-lab ok, have gone back over the SeriesLoader class and made sure it Does The Right Thing wrt data types whether loading straight into memory or saving out binary files. I think I just forgot about the SeriesLoader "save" path earlier when removing the cast to float on writing out series files - I got the equivalent Images pathway, but just missed this one. Anyway, have also added a bunch of data type assertions to tests, both the unit tests here and my quasi-automated conversion test script, and so am reasonably confident this should be solid at this point. Famous last words right?

freeman-lab · 2014-11-15T02:14:33Z

Sorry for the delay, LGTM! Great work figuring all this out, I think we've landed on a good solution.

fix incorrect propagation of dtype in Series normalize and other methods

industrial-sloth added 2 commits November 6, 2014 17:04

fix incorrect propagation of dtype in Series normalize and other methods

a68bb18

Merge remote-tracking branch 'upstream/0.4.x' into 0.4.x

7925a2b

freeman-lab reviewed Nov 7, 2014
View reviewed changes

industrial-sloth added 4 commits November 7, 2014 08:43

move apply method up to Data from Images and Series; __finalize__ met…

f4573c1

…hod docstring changes

specify float64 output from Series correlate

688cd6d

enforce float-valued Series data, remove casting in Series operations

a9ef01f

update test expectations for new Series data types

5876e5c

cast series binary data to float on loading; prevent series cast to i…

a71869a

…nt via astype

remove cast to float on writing series binary data, remove check agai…

2013f97

…nst explicit casting series to non-float

freeman-lab reviewed Nov 11, 2014
View reviewed changes

fix SeriesLoader data type handling; add data type assertions to tests

c20978d

freeman-lab added a commit that referenced this pull request Nov 15, 2014

Merge pull request #46 from industrial-sloth/0.4.x

92f44d2

fix incorrect propagation of dtype in Series normalize and other methods

freeman-lab merged commit 92f44d2 into thunder-project:0.4.x Nov 15, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix incorrect propagation of dtype in Series normalize and other methods #46

fix incorrect propagation of dtype in Series normalize and other methods #46

industrial-sloth commented Nov 6, 2014

freeman-lab Nov 7, 2014

industrial-sloth Nov 7, 2014

freeman-lab commented Nov 7, 2014

industrial-sloth commented Nov 7, 2014

industrial-sloth commented Nov 9, 2014

industrial-sloth commented Nov 9, 2014

freeman-lab commented Nov 9, 2014

industrial-sloth commented Nov 9, 2014

freeman-lab Nov 11, 2014

industrial-sloth Nov 11, 2014

freeman-lab commented Nov 11, 2014

industrial-sloth commented Nov 11, 2014

freeman-lab commented Nov 15, 2014

fix incorrect propagation of dtype in Series normalize and other methods #46

fix incorrect propagation of dtype in Series normalize and other methods #46

Conversation

industrial-sloth commented Nov 6, 2014

freeman-lab Nov 7, 2014

Choose a reason for hiding this comment

industrial-sloth Nov 7, 2014

Choose a reason for hiding this comment

freeman-lab commented Nov 7, 2014

industrial-sloth commented Nov 7, 2014

industrial-sloth commented Nov 9, 2014

industrial-sloth commented Nov 9, 2014

freeman-lab commented Nov 9, 2014

industrial-sloth commented Nov 9, 2014

freeman-lab Nov 11, 2014

Choose a reason for hiding this comment

industrial-sloth Nov 11, 2014

Choose a reason for hiding this comment

freeman-lab commented Nov 11, 2014

industrial-sloth commented Nov 11, 2014

freeman-lab commented Nov 15, 2014