Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to force float64 instead of float32 issue #2304 #2751

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

DevDaoud
Copy link

@DevDaoud DevDaoud commented Feb 7, 2019

Ability to promote to float64 instead of float32 when dealing with int variables with scale_factor.
added parameter force_promote_float64 (False by default) to open_dataset and open_dataarray that enables this behaviour when True.

Daoud Jahdou added 5 commits February 7, 2019 15:10
When dealing with int variables with scale factor xarray transform them to float32
by default.
This enhancement aims to give the possibity to force that to float64 when needed.
The behaviour is unchanged when not specifying the parameter force_promote_float64.
@DevDaoud
Copy link
Author

DevDaoud commented Feb 8, 2019

@shoyer did you have a look at this ?

@shoyer
Copy link
Member

shoyer commented Feb 8, 2019

@daoudjahdou sorry I missed your comment over in #2304. Threading the new parameter down though other functions is definitely the preferred approach here -- mutating global variables makes it much harder to follow control flow. Though I would probably stop short of modifying maybe_promote and would put this logic inside CFMaskCoder and/or CFScaleOffsetCoder in xarray.coding.variables.

@DevDaoud
Copy link
Author

DevDaoud commented Feb 12, 2019

@shoyer the logic is now propagated down to _choose_float_dtype inside CFScaleOffsetCoder, please let me know what you think.

@shoyer
Copy link
Member

shoyer commented Feb 12, 2019

@daoudjahdou what do you think of @magau's proposed solution over in #2304 (comment)? Checking the dtypes of of add_offset and scale_factor might be a better alternative than requiring setting a custom option.

@DevDaoud
Copy link
Author

@shoyer yes sure i'll update the pull request with the mentioned modifications.

@DevDaoud
Copy link
Author

DevDaoud commented Feb 18, 2019

@shoyer now scale_factor and add_offset are taken into account when encoding and decoding data.
if none of them is present or if they're not a subtype of np.generic the old behaviour takes place.

dtype = _choose_float_dtype(data.dtype, 'add_offset' in encoding)
scale_factor = pop_to(attrs, encoding, 'scale_factor', name=name)
add_offset = pop_to(attrs, encoding, 'add_offset', name=name)
dtype = _choose_float_dtype(data.dtype, scale_factor, add_offset)
data = data.astype(dtype=dtype, copy=True)
if 'add_offset' in encoding:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line (and the line below for scale_factor) needs to be updated to use the add_offset variable if it isn't None rather than checking for it in encoding again. (pop_to removes an item from the dict in which it's found.)

# from the variable (containing the packed data)
# then the unpacked data should match
# the type of these attributes, which must both be of type float or both
# be of type double. An additional restriction in this case is that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if add_offset/scale_factor are not floats?

Right now in your implementation, if scale_factor is mistakenly np.int64(10) then the data would get decoded as int64, which seems like a bad idea (e.g., if the data is some float type). I think we would should probably ignore the type of scale_factor/add_offset in these cases instead.

Copy link

@magau magau Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shoyer! Sorry for my delayed intervenience... First of all I'd like to thank and congratulate you for your efficiency and say that I'm very glad to see that you've taken into account my considerations about this topic.

About the comment above I'd say that, since the CFScaleOffsetCoder will always multiply the raw data by the scale_factor, in case the scale_factor is an integer the decoded values will always result into integers too (unless the raw data dtype is float, as you've mentioned).
Even in these cases I'd return the decoded values with the same dtype as the scale_factor / add_offset, which is the expected behavior (data providers must be aware...). Users can always use open_dataset with decode_cf=False and call the decode_cf_variable function later, after fixing the variable.attrs['scale_factor / add_offset'] dtypes.
A possible solution is add the scale_factor and add_offset as arguments of the open_dataset function, to overwrite the original ones, which wold receive labeled values like: scale_factor={'var1': whatever, 'var2': whatever}, add_offset: {'var1': whatever, 'var2': whatever}. This wold also resolve the #2304 issue (with a better approach, in my opinion).

By the way, I've also noted that the CFScaleOffsetCoder.encode changes the values.dtype before applying the inverse operation (i.e the dividing by the scale_factor), this will result in truncated values. I've already faced this situation, with my own netcdf encoding processes, and would advice you to only change the dtype after applying the transformation and use the numpy.round function (for integers encoding) to avoid the truncation artifacts.

Once again, sorry for my extensive comments and not opening an issue, as I should, but my intention is that you evaluate my considerations and take your own decisions with your colleagues, and keep the nice work you all have being done. 👍

Cheers!

# We first return scale_factor type
# as multiplying takes priority over
# adding values
if dtype is not type(scale_factor) and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to check whether the type of scale_factor matches dtype? I think the logic works the same either way.

# as multiplying takes priority over
# adding values
if dtype is not type(scale_factor) and \
isinstance(scale_factor, np.generic):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: per PEP8, please prefer using parentheses () to split across multiple lines instead of \

@@ -1156,6 +1157,25 @@ def test_mask_and_scale(self):
expected = create_masked_and_scaled_data()
assert_identical(expected, ds)

def test_mask_and_scale_with_float64_scale_factor(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test all of the edge cases just using CFScaleOffsetCoder, e.g.,

  • what happens if add_offset and scale_factor have different types?
  • what if they're integers instead of floats?

See the tests in tests/test_coding.py for examples.


# We first return scale_factor type
# as multiplying takes priority over
# adding values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous about making this assumption. The NumPy casting rules for a * x + b would give the result the largest dtype of any of a, x and b. Maybe we should similarly pick whichever of scale_factor, data and add_offset has the largest float dtype?

@DevDaoud
Copy link
Author

DevDaoud commented Feb 22, 2019

@shoyer i changed the implementation, and took into consideration your comments.
now returning largest type takes place only when decoding.
i added a test with all types of scale factor, add_offset and variable.

@DevDaoud
Copy link
Author

@shoyer did you have a look at this ?

np.dtype(type(add_offset)),
np.dtype(dtype)]
# scaled_type should be the largest type we find
scaled_dtype = max(types)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use xarray.core.dtypes.result_type() here instead of max().

I don't know how NumPy defines comparison between dtypes and I can't find it documented anywhere, so I'm a little nervous about relying on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer i've tested it with all numpy dtypes pair combinations and it turns out that it always gives the largest, i can for sure modify it

# It is not advised to unpack an int into a float as there
# is a potential precision loss.

if mode is 'decoding':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a mode argument, could you write two separate functions, _choose_encoding_float_dtype and _choose_decoding_float_dtype? It would be OK for one to call the other, but mode arguments are mild code smell.

@@ -685,7 +685,9 @@ def test_roundtrip_mask_and_scale(self, decoded_fn, encoded_fn):
with self.roundtrip(decoded) as actual:
for k in decoded.variables:
assert (decoded.variables[k].dtype
== actual.variables[k].dtype)
== actual.variables[k].dtype or
(decoded.variables[k].dtype == np.float32 and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why this test needs to change. Can you give an example of what behavior changed?

Copy link
Author

@DevDaoud DevDaoud Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer
Here is my understanding of what xarray does :
In all cases encoding returns either np.float32 or np.float64
Encoding only cares about existence of add_offset when the variable is an integer
If the variable is a float then the encoded dtype is np.float32
If the variable is an integer with add_offset then the encoded dtype np.float64
If the variable is an integer with no add_affset the encoded dtype np.float32
In all other cases the encoded dtype is np.float64

Here is my understanding of what cf-conventions imply:
decoding is the equivalent of unpacking mentioned in the cf-convention
in all cases decoding takes the encoded dtype which is either np.float32 or np.float64
then decoded type is the largest between scale_factor, add_offset and encoded type and not original variable

That being said in that test there are cases where
original type, scale_factor, add_offset are on the following configuration:
np.float32, np.int64, np.float32
then the encoded type is np.float32 as the variable is np.float32
when decoding the largest (dtypes.result_type()) between the three (encoded, sf, ao) is np.float64

and that's why "decoded" (orignal) in the test is np.float32 and actual (roundtripped) is np.float64

In my next commit i took into consideration your remark and i wrote every possible case and what to expect as an encoded or decoded dtype.

# We make sure that roundtripped is larger than
# the original
assert roundtripped.dtype.itemsize >= original.dtype.itemsize
assert (roundtripped.dtype is np.dtype(np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's safer to use == instead of is for comparing dtype objects

np.int64(10), np.uint8(10),
np.uint16(10), np.uint32(10),
np.uint64(10), np.uint64(10)])
def test_scaling_according_to_cf_convention(dtype, scale_factor, add_offset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great set of invariant checks! But it would still be nice to see a more specific list of unit tests verifying that the decoded dtype is exactly as expected for a set of known inputs, e.g.,

@pytest.mark.parametrize(
    ('input_dtype', 'scale_factor', 'add_offset', 'expected_dtype'),
    [
        (np.float32, 1, 0, np.float32),
        (np.float32, 1, 10, np.float64),
        ...
    ]
)
def test_cfscaleoffset_decoding_dtype(input_dtype, scale_factor, add_offset, expected_dtype):
    original = xr.Variable(...)
    decoded = variables.CFScaleOffsetCoder().decode(original)
    assert decoded.dtype == expected_dtype

This would us confidence that it's handling all the edge cases properly, e.g., only giving a bigger dtype when truly warranted.

@DevDaoud
Copy link
Author

DevDaoud commented Mar 8, 2019

@shoyer tests are failing but it does'nt seem to be coming from this PR, i saw the same error on other PRs as well, my tests were working fine until i made a git pull.

@shoyer
Copy link
Member

shoyer commented Mar 15, 2019

@daoudjahdou thanks for sticking with this.

I'm concerned about how lose the "roundtrip" invariant in some edge cases, where the user did not specifically indicate the desired dtype but rather used a Python type. Right now xarray seems to use inconsistent rules for casting attribute types, based on the netCDF backend:

In [8]: for engine in ['scipy', 'netcdf4', 'h5netcdf']:
   ...:     ds = xarray.Dataset({'x': np.float32(0)})
   ...:     encoding = {'x': {'scale_factor': 1, 'add_offset': 1}}
   ...:     ds.to_netcdf(f'test-{engine}.nc', engine=engine, encoding=encoding)
   ...:

In [9]: ! ncdump -h test-scipy.nc
netcdf test-scipy {
variables:
	float x ;
		x:add_offset = 1 ;
		x:scale_factor = 1 ;
		x:_FillValue = NaNf ;
}

In [10]: ! ncdump -h test-netcdf4.nc
netcdf test-netcdf4 {
variables:
	float x ;
		x:_FillValue = NaNf ;
		x:add_offset = 1LL ;
		x:scale_factor = 1LL ;
}

In [11]: ! ncdump -h test-h5netcdf.nc
netcdf test-h5netcdf {
variables:
	float x ;
		x:_FillValue = NaNf ;
		x:add_offset = 1LL ;
		x:scale_factor = 1LL ;
}

At least for add_offset and scale_factor, I think we should be more careful when determining the attribute dtype. Python's int and float don't have any intrinsic dtype. If the attributes are specified as native Python types (not NumPy types) and can be safely cast into the dtype of the data, we should probably use that instead.

@DevDaoud
Copy link
Author

@shoyer do you mean that we consider that by default when we deal with Python's int and float we cast them to np.int64 and np.float64 ?

@shoyer
Copy link
Member

shoyer commented Mar 16, 2019

@shoyer do you mean that we consider that by default when we deal with Python's int and float we cast them to np.int64 and np.float64 ?

Currently we cast to int32/float64 for netCDF3 files (which don't support int64) and int64/float64 for netCDF4 files.

That default behavior doesn't really make sense if the user provided builtin Python numbers for these attributes.

For example, if I write something like the following, the data for x should probably decode to float32:

ds = xarray.Dataset({'x': np.float32(0)})
encoding = {'x': {'scale_factor': 1, 'add_offset': 1, 'dtype': 'int8'}}
ds.to_netcdf('test.nc', encoding=encoding)

With the current implementation in this PR, I think the data will come back as float64.

To get something that will decode as float64 without the original data being float64, the user should need to intentionally choose dtypes for scale_factor and add_offsert, e.g.,

ds = xarray.Dataset({'x': np.float32(0)})
encoding = {'x': {'scale_factor': np.float64(1), 'add_offset': np.float64(1), 'dtype': 'int8'}}
ds.to_netcdf('test.nc', encoding=encoding)

Does this make sense to you? From my perspective, two goals for this change should be:

  1. Adhere to CF conventions for deciding how to decode data from disk.
  2. Provide a consistent interface for xarray users, so that by default data is restored in the same form that it was written.

@DevDaoud
Copy link
Author

DevDaoud commented Apr 4, 2019

@shoyer sorry for the delayed response.
dtypes.result_type(1, np.float32(1)) returns dtype('float64')
That's what makes this behaviour for Python's int and float
Keeping the consistency would then require testing if scale_factor*var_dtype + add_offset fit in var_dtype in case of Python's int and float.
Correct me if i'm wrong but this is a bit hard to do without knowing max,min values to avoid overlapping
Evaluating those could break performance if only used for encoding and decoding.
Do have any idea how this could be acheived ? or is it simpler to keep it as it is ?

@shoyer
Copy link
Member

shoyer commented Apr 8, 2019

Correct me if i'm wrong but this is a bit hard to do without knowing max,min values to avoid overlapping
Evaluating those could break performance if only used for encoding and decoding.

Right, we definitely don't want to inspect the data to verify if it fits.

I think we will will need explicitly checks for isinstance(x, (int, float)) (for scale_factor and add_offset) to identify these cases in the encode() method. Let me try to come up with a concrete suggestion... EDIT: See my suggestion below. I don't think we want these explicit checks, after all.

@@ -217,9 +258,9 @@ class CFScaleOffsetCoder(VariableCoder):

def encode(self, variable, name=None):
dims, data, attrs, encoding = unpack_for_encoding(variable)

if 'scale_factor' in encoding or 'add_offset' in encoding:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative logic for this clause that I think might work:

if 'scale_factor' in encoding or 'add_offset' in encoding:
    scale_factor = encoding.pop('scale_factor', 1)
    add_offset = encoding.pop('add_offset, 0)
    if 'dtype' in encoding:
        dtype = pop_to(encoding, attrs, 'dtype', name=name)
    else:
        dtype = _choose_encoding_float_dtype(data.dtype, add_offset != 0)

    # save scale_factor and add_offset with dtype matching the decoded
    # data, per CF conventions
    safe_setitem(attrs, 'scale_factor', data.dtype.type(scale_factor), name)
    safe_setitem(attrs, 'add_offset', data.dtype.type(add_offset), name)

    # apply scale/offset encoding
    data = data.astype(dtype=dtype, copy=True)
    if add_offset:
        data -= add_offset
    if scale_factor != 1:
        # multiplication is faster than division
        data *= 1/scale_factor

The idea here is to always write the scale_factor and add_offset attributes according to CF conventions, which ensures that data always gets read out the right way. This does remove some user flexibility, but xarray is OK with being opinionated about how it writes metadata -- it is not a general purpose tool for writing completely flexible netCDF files.


if add_offset is None:
types = (np.dtype(type(scale_factor)),
np.dtype(dtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if these five lines creating types handled add_offset and scale_factor in a symmetric way, e.g.,

types = [np.dtype(dtype)]
if scale_factor is not None:
    types.append(np.dtype(type(scale_offset))
if add_offset is not None:
    types.append(np.dtype(type(add_offset))

# It is not advised to unpack an int into a float as there
# is a potential precision loss.

if scale_factor or add_offset:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safer to use is not None for checking default values:

Suggested change
if scale_factor or add_offset:
if scale_factor is not None or add_offset is not None:

and np.issubdtype(scaled_dtype, np.floating)):
return scaled_dtype

return _choose_encoding_float_dtype(dtype, add_offset is not None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we could use the add_offset value here -- 0 and None are equivalent as far as picking an automatic float dtype is concerned, indicating "no offset":

Suggested change
return _choose_encoding_float_dtype(dtype, add_offset is not None)
return _choose_encoding_float_dtype(dtype, add_offset)

@DevDaoud
Copy link
Author

@shoyer i've tested the solution provided, it works like a charm with my tests however many tests are broken on test_backends.py with cases where we lose precision i'll give you more detail.

@fmaussion
Copy link
Member

I've noted that the very recent ERA5 data is also cast to float32 at decoding, although the scale and offset params are float64. I think it would be a very valuable addition to xarray to do this properly: @daoudjahdou are you still willing to work on this PR? Otherwise I'll try to pick this up when time permits.

@mankoff mankoff mentioned this pull request Oct 6, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray
5 participants