Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to format xarray.DataArray as strings for issue #5985 #7628

Closed
wants to merge 20 commits into from

Conversation

husainridwan
Copy link

@husainridwan husainridwan commented Mar 15, 2023

@husainridwan husainridwan changed the title Add method to format xarray.DataArray as strings after round them to significant figures. Add method to format xarray.DataArray as strings after round them to significant figures for issue #5985 Mar 15, 2023
@husainridwan
Copy link
Author

@TomNicholas Can you review this for me?

@headtr1ck
Copy link
Collaborator

headtr1ck commented Mar 15, 2023

Wouldn't it be more useful to have something like da.str.format("sometext: {:2.2f}")

Or maybe da.as_str(...)

@headtr1ck
Copy link
Collaborator

By the way, this works (but probably slow?):

def tostr(x):
    return f"{:.2f}"

tostr_v = np.vectorize(tostr)


da = xr.DataArray([1, 2, 3.456], dims="x")

tostr_v(da)

@husainridwan
Copy link
Author

I think the initial issue was to create a method that allows any specific significant figure to be used and then formatted as strings after. I think I'm having issues with a floating dtype. Check the image below
dataarray py - xarray - Visual Studio Code  Administrator  15_03_2023 11_30_44

@husainridwan
Copy link
Author

@k

@husainridwan
Copy link
Author

@TomNicholas & @headtr1ck , Can you review this so I can know what to change? I think it's failing just one test

@headtr1ck
Copy link
Collaborator

Again, I think that this implementation is too specific and the issue wanted a more general string formatting option.

Also, auch a method is better kept as part of the StrAccessor.

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel that the suggestion here is a bit too specific, so before discussing the implementation (and fixing tests) I think we should figure out a good API. Below I'm suggesting .str.format because that's what the original issue proposed, but really the only requirement I have is that the API feels natural and does not do too much at once.

The original issue also suggested

xr.DataArray("%.3f").str % arr

which would allow having a array of formats, but I feel that's a bit too complicated for now (or in general, not sure).

doc/warnings.log Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has been added by accident. Could you remove that file?

@@ -6670,6 +6670,68 @@ def resample(
**indexer_kwargs,
)

def roundStringify(self, n: int) -> DataArray:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use snake-case for our functions, so it should be round_stringify. However, before you change that I agree with @headtr1ck that the method combines two different things (rounding plus string interpolation).

As an API, I would probably prefer something like this:

(
    arr.round(0.01)  # round to two decimal digits
    .str.format("%.03f")  # display three decimals
)

where .str.format can be implemented in a efficient manner using numpy.char.mod.

The advantage of that is that it is a lot more versatile, for example:

ds = xr.Dataset({"a": ("x", ["a", "b", "c"]), "b": ("x", [0.2756, 6.7279, 3.7832])})
ds.a.str.format("%s: ").str + ds.b.str.format("%.03")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keewis str.format already exists and behaves like the python str.format method (also different from the builtin format() method).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keewis str.format already exists and behaves like the python str.format method (also different from the builtin format() method).

Good find.
This method is working the other way around as the proposed solution, but it should work perfectly.

Untested:

fmt = xr.DataArray("{:.2f}")
da = xr.DataArray([1, 2, 3.456], dims="x")

fmt.str.format(da)

Copy link
Collaborator

@keewis keewis Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue with that is that it will "stringify" the values before passing it to str.format, so the only format you can pass it is :s. So I guess the only thing to change is that?

Comment on lines 6731 to 6748
def test_roundStringify_output_type(self, my_obj):
result = my_obj.roundStringify(3)
assert isinstance(result, xr.DataArray), "output should be a DataArray"
assert isinstance(result, xr.DataArray), "output should be a DataArray"

def test_roundStringify_attrs(self, my_obj):
attrs = {"units": "meters"}
my_obj.attrs = attrs
new_da = my_obj.roundStringify(2)
assert "units" in new_da.attrs.keys(), "expected attribute not found"
assert new_da.attrs["units"] == attrs["units"], "attribute values not matched"

def test_roundStringify_edge_case(self):
# test case when all values are zero
zeros = xr.DataArray(np.zeros((3,), dtype=float), dims=["x"])
rounded_zeros = zeros.roundStringify(n=3)
expected_result = xr.DataArray(["0", "0", "0"], dims=["x"])
xr.testing.assert_allclose(rounded_zeros, expected_result)
Copy link
Collaborator

@keewis keewis Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use a actual-expected pattern (if possible), combined with heavy use of pytest.mark.parametrize.

Edit: oh, sorry, I just saw that the last test already uses that pattern.

For example, the attrs check could become:

@pytest.mark.parametrize("keep_attrs", [True, False])
def test_str_format_attrs(self, keep_attrs):
    attrs = {"units": "meters"}
    arr = xr.DataArray(["a", "b", "c"], attrs=attrs)

    expected = arr.copy()
    if not keep_attrs:
        expected.attrs.clear()
    with xr.set_options(keep_attrs=keep_attrs):
        actual = arr.str.format("%s")
        xr.testing.assert_identical(actual, expected)

but of course if we test .str.format it should go into the module that contains the tests for the string accessor (xarray.tests.test_accessor_str, if I'm not mistaken)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much. I'll look into the changes and recommendations right away

@husainridwan husainridwan changed the title Add method to format xarray.DataArray as strings after round them to significant figures for issue #5985 Add method to format xarray.DataArray as strings for issue #5985 Mar 17, 2023
@husainridwan
Copy link
Author

@headtr1ck @keewis can you check it now? I’ve modified it a bit

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since da.str.format already exists, I am not sure how much benefit this function gives.

But apparently it is difficult to discover and use da.str.format, so maybe a simpler interface is useful.

Anyway, I'm the current state the implementation is still too specific.

@@ -51,6 +51,7 @@

import numpy as np

import xarray as xr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to import this

array(['1.46e+05', '-3.25e+06', '0.00125'], dtype='<U8')
Dimensions without coordinates: x
"""
self._obj = self._obj.astype(float)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not change the stored object

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to ensure all inputs were the same dtype before formatting in case of nan or inf values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok, but then store it in a temporary, local variable.

Dimensions without coordinates: x
"""
self._obj = self._obj.astype(float)
return xr.DataArray(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use type(self._obj)(...)

Copy link
Collaborator

@headtr1ck headtr1ck Mar 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do you even need to wrap it into a DataArray? What does _apply return?

xarray/core/accessor_str.py Outdated Show resolved Hide resolved
@husainridwan
Copy link
Author

@keewis it seems .str.format does what we want the issue is requesting. Is there any need for the issue or can you clarify exactly what needs to be improved?

@husainridwan
Copy link
Author

@keewis @TomNicholas @headtr1ck any comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Formatting data array as strings?
4 participants