-
Notifications
You must be signed in to change notification settings - Fork 41
DESIGN: NA values in floating point arrays #46
Comments
I think there are a number of considerations here:
A possibility is actually to fill in prob just muddying the waters...... |
We also will consistently want to track null counts, so having a single code path for doing anything with nulls would be nice (i.e. bitmaps everwhere). We can think some more about this. If we're making a performance argument for continuing to use NaN, we should definitely have the hard numbers before making a decision based on that. |
My inclination on thinking some about this would be to use bitmaps for everything. The downside of this is that people have code littered with
they could change this to
to get the same behavior. We could always alias NaN to NA, or make it configurable so that people can run their code and get a warning if they're assigning NaN values somewhere in their code. |
why would this matter? |
I support bitmaps everywhere, for what it's worth. |
@jreback we could make the design concession that NaN is treated as NA by default, so that these are all equivalent
I know that this is already the way things are (save for the lack of an NA singleton value), but would want to make sure that we don't wish to introduce a distinction between NaN and NA. I'm kind of betting not given the amount of legacy code out there that uses NaN to mask out values. |
How will the [1.0, nan, 3.0] Is the plan that any numerical array or column will have |
well as I said above, we have coerce on the setitem (e.g. translate and null-like to pa.NaN) or use a bit-mask or both, no matter how its internally represented. I think that's an orthogonal, mainly perf decision. E.g. also consider if |
@chrisaycock not sure, but my guess is the display will become NA rather than exposing the internal representation to the user |
@jreback in setitem we can definitely map a set of sentinel values (NaN, NaT, None) to NA, it's more a memory representation question -- if incoming data has NaNs we will need to construct a bitmap and then ignore the NaN values from that point forward. |
Even if we use a separate bit mask, we could do zero copy conversions to arrays with NaNs in most cases, because pandas won't care about these values, so we can modify the original array (as long as we have ownership). We could even set a flag so we don't have to do this more than once. I think this solves most of the issues for exporting data. |
That's a good point. Mutating the null slots is safe, so that would safe on memory / copying |
@shoyer that was my point 5) however, you introduce more complexity in the code. Then you have to make sure to actually set the value if a bit is changed from not-null to null internally (rather than just do it once on exporting). It may not be that big of a deal (and may or may not matter for perf). I would vote for keeping the code as simple as possible and so having only bitmaps may be preferable. |
We're already planning on having flags that are invalidated whenever there is mutation (#27). This would just be another one, e.g., |
No idea if this is a representative (probably not even particularly good code) but I was curious so here's a reduction microbenchmark. This is MSVC, with a only a 64 bit popcnt, but at least in this case checking https://gist.github.com/chris-b1/59743744458a2e0a628f84a817b14a50
|
I get somewhat different results on my machine (1.2 ghz core m5, clang 8.0) $ gcc -O3 -msse4.2 -std=c++1y -lc++ bitmap-test.cc
22:39 ~/code/tmp $ ./a.out
N = 100000000 missing = 0 blocks = 1
Bitmask / NAN = 0.820946
N = 100000000 missing = 0 blocks = 2
Bitmask / NAN = 0.747475
N = 100000000 missing = 0 blocks = 4
Bitmask / NAN = 0.8
N = 100000000 missing = 0.01 blocks = 1
Bitmask / NAN = 0.81982
N = 100000000 missing = 0.01 blocks = 2
Bitmask / NAN = 1.02516
N = 100000000 missing = 0.01 blocks = 4
Bitmask / NAN = 0.98503
N = 100000000 missing = 0.05 blocks = 1
Bitmask / NAN = 1.28716
N = 100000000 missing = 0.05 blocks = 2
Bitmask / NAN = 1.1372
N = 100000000 missing = 0.05 blocks = 4
Bitmask / NAN = 1.26923 |
My understanding is that AVX2 instructions are even faster for popcnt, so we can check larger blocks |
Just cross-referencing in case there's anything to be learnt from previous discussions: (though it's also linked to from @ResidentMario's blog) |
FWIW, my 6-years-later digest of NumPy's problems with missing data is that NumPy's user base is too heterogeneous. pandas's user base of "analytics" users is a smaller [EDIT: I had a typo "small"... pandas's user base is not small] subset of NumPy, but is also part of a broader ecosystem of tools, including SQL engines, parts of R, etc. I also think that efforts to unify tensor memory layout (aka ndarrays) with columnar/tabular memory are deeply problematic. I don't see a great deal of overlapping needs between the in-memory columnar problem (i.e. 95-99% of how pandas is used) and the tensor problem (i.e. how NumPy is used for scientific computing a la MATLAB / TensorFlow). For us, I think that simplifying the world to one-dimensional contiguous arrays with bitmaps to encode validity / nullness will make pandas much easier to understand and develop. If you can do a zero-copy cast from numeric data to a tensor / ndarray and use math functions in NumPy, then apply a bitmap to the result (also zero-copy), then you don't have the "throw the baby out with the bathwater" problem. |
Do we want to continue to use NaN? There are possible computational benefits to doing so, but we have the opportunity with pandas 2.0 to change to using bitmaps everywhere, which brings a lot of consistency to how missing data is handled (for example:
isnull
becomes free under this scenario). We may want to do some benchmarking to understand the overhead of using bitmaps in arithmetic and aggregations (rather than letting the CPU propagate NaN where relevant).One benefit of the bitmap route is that we can use hardware popcount to skip null checking on groups of 64 or 128 bits at a time (or more, if there are AVX popcount instructions available, not actually sure about this), so the performance in aggregations may actually get a lot better on mostly non-null data.
The text was updated successfully, but these errors were encountered: