Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add from_pandas and from_xarray #2054

Merged
merged 17 commits into from Jul 19, 2021

Conversation

Tom-Willemsen
Copy link
Contributor

@Tom-Willemsen Tom-Willemsen commented Jul 14, 2021

Fixes #2049

  • I've implemented attr handling where it makes sense and is supported by both source and target
  • I haven't implemented any unit handling as it seems to be done in attributes for xarray, and parsing strings felt like it may be a bit too error-prone. Happy to reconsider if there are strong opinions about this.

Comment on lines 83 to 92
for attrib in xarray_dataset.attrs:
attrib_name = "{}{}".format(attrib_prefix, attrib)
if attrib_name in sc_data:
raise ValueError(
f"Attribute {attrib} would collide with an existing "
f"data object '{attrib_name}' (using attrib_prefix "
f"'{attrib_prefix}'). Specify a different attrib_prefix "
f"in the call to from_xarray_dataset")

sc_data[attrib_name] = scipp.scalar(xarray_dataset.attrs[attrib])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this is the correct approach in scipp or not, scipp.DataSet doesn't seem to support attrs directly unlike scipp.DataArray? The alternative would probably be to drop the xarray attrs (possibly with a warning about loss of information?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping is likely the correct approach, especially given that xarray coords (which are also xr.DataArray) can also have attributes, and we don't have a way of supporting that either.

@SimonHeybrock SimonHeybrock changed the title Issue2049 add xarray pandas converters Add from_pandas and from_xarray Jul 15, 2021
Copy link
Member

@SimonHeybrock SimonHeybrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start!

I haven't implemented any unit handling as it seems to be done in attributes for xarray, and parsing strings felt like it may be a bit too error-prone. Happy to reconsider if there are strong opinions about this.

If you refer to parsing unit strings, that is supported already (using another library under the hood), you can, e.g., sc.Unit('m/s'). The question is maybe how to find attributes that represent a unit? xarray appears to be commonly used with NetCDF and it looks like the corresponding attribute name is 'units'. Thus I might suggest looking for this attribute and use it as the unit. We may also consider supporting singular 'unit'.

Another things to consider for from_xarray: xarray distinguishes "index coords" (marked with *) and other coords. In terms of how they behave we should:

  • Convert "index coords" into scipp coords
  • Convert other coord into scipp attrs

See below for other comments:

@@ -0,0 +1,47 @@
import pandas

import scipp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is part of scipp cannot import scipp (without running into recursion). Rather use things such as:

from .. import scalar

@@ -0,0 +1,47 @@
import pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to import this file (or functions from it) in __init__.py we have to make this import optional, for example import only when from_pandas is called. Otherwise this turns into a mandatory scipp dependency (module won't load unless pandas is installed).

sc_attribs[attr] = scipp.scalar(dataframe.attrs[attr])

row_index = dataframe.axes[0]
rows_index_name = row_index.name or "pandas_row"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just 'row'?

)
}

if dataframe.ndim == 2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think what is done below will work in general: What pandas calls "2 dimensions" for a DataFrame is actually a 1-dimensional dataset. I think what we should do is to return a 1-D Dataset. dataframe.axes[1] basically gives the names of the dataset entries.

One might consider returning a DataArray if there is only a single column, but maybe that is too surprising and should be avoided.

)

return scipp.DataArray(
data=scipp.Variable(values=dataframe.to_numpy(), dims=sc_dims),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that to_numpy does type conversions of columns if column dtype mismatches. Even worse, if there are things such as string columns the output array has object dtype (which is inefficient and not well supported by scipp). Basically: See above, return 1-D Dataset instead.

Comment on lines 73 to 75
for data_key in xarray_dataset.data_vars:
sc_data[data_key] = from_xarray_dataarray(
xarray_dataset.data_vars[data_key])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for data_key in xarray_dataset.data_vars:
sc_data[data_key] = from_xarray_dataarray(
xarray_dataset.data_vars[data_key])
items = { key:from_xarray(var) for key, var in ds.data_vars.items() }

would be briefer

Comment on lines 78 to 81
sc_coords[coord] = scipp.Variable(
dims=[coord],
values=xarray_dataset.coords[coord].values,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (make variable from xarray object) is repeated a couple of times, also in from_xarray_dataarray above. Can we use common code? xarray also has a Variable type, but I am not sure that is what you get from xarray_dataset.coords[coord] (that might be a DataArray), so I don't know if you can automatically do this using from_xarray, or need a helper such as _from_xarray_variable:

coords = { key:_from_xarray_variable(var) for key, var in ds.coords.items() }

(ideally we'd want to simply use coords = { key:from_xarray(var) for key, var in ds.coords.items() }, but you will have to see if you can figure out automatically that it should create a Variable and not a DataArray).

Comment on lines 83 to 92
for attrib in xarray_dataset.attrs:
attrib_name = "{}{}".format(attrib_prefix, attrib)
if attrib_name in sc_data:
raise ValueError(
f"Attribute {attrib} would collide with an existing "
f"data object '{attrib_name}' (using attrib_prefix "
f"'{attrib_prefix}'). Specify a different attrib_prefix "
f"in the call to from_xarray_dataset")

sc_data[attrib_name] = scipp.scalar(xarray_dataset.attrs[attrib])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping is likely the correct approach, especially given that xarray coords (which are also xr.DataArray) can also have attributes, and we don't have a way of supporting that either.

Comment on lines 11 to 12
assert (sc_dataarray.values == scipp.array(dims=["pandas_row"],
values=[1, 2, 3]).values).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use

Suggested change
assert (sc_dataarray.values == scipp.array(dims=["pandas_row"],
values=[1, 2, 3]).values).all()
assert sc.identical(sc_dataarray, scipp.array(dims=["pandas_row"], values=[1, 2, 3]))

This will also ensure that unit, dimension labels, coords, ... are correct. Same goes for everything below, you should basically never assert using the values or value property.

from scipp.compat.xarray_compat import from_xarray


def test_empty_attrs_dataarray():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't see a test with coords?

Comment on lines 40 to 44
coords={
rows_index_name:
Variable(dims=[rows_index_name],
values=row_index)
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same coord that gets added below? No need to repeat it here?

},
name=column_name or "")

return Dataset(data=sc_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get a Series, should we return a DataArray (or even just a Variable)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the code so that if a Series is passed in, a DataArray is returned rather than a Dataset. However for the 1-D Dataframe case it is still returning a Dataset (with 1 item) rather than DataArray because I felt that changing the return type based on the number of dimensions (rather than the type of object being used) would be too confusing as an API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"""
Converts xarray coords to scipp coords.

Note: only converts dimension coords
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code doesn't look like that?

Comment on lines 45 to 51
coords = {}

for coord_name, coord in xr_obj.coords.items():
is_index = (coord_name in xr_obj.indexes)
if is_index == index:
coords[coord_name] = Variable(dims=coord.dims, values=coord.values)
return coords
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

coords = {}
attrs = {}
for name, coord in xr_obj.coords.items():
  if name in xr_obj.indexes:
    coords[name] = 
  else
    attrs[name] = _var_from_xarray(coord)
return coords, attrs

with

def _var_from_xarray(xr_obj):
  unit = coord.attrs['units'] if units in coord.attrs else ''
  return Variable(dims=coord.dims, values=coord.values, unit=unit)

The latter should be used everywhere, since we want to look for units everywhere.

:param ds: an xarray.Dataset object to be converted.
:return: the converted scipp dataset object.
"""
sc_data = {k: from_xarray(v) for k, v in ds.data_vars.items()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the ds.data_vars.items() needed? What is the difference to ds.items()?

if name in da.indexes:
coords[name] = _var_from_xarray(coord)
else:
attrs[f"coord_{name}"] = _var_from_xarray(coord)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be renaming the coords? I suppose you do this to avoid clashes with xarray-attributes? I think we should either rename the xarray-attributes, or simply drop them if there is a name clash.

Copy link
Member

@SimonHeybrock SimonHeybrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Sorry, had posted comment as individual comment above, ignore this.

@SimonHeybrock SimonHeybrock merged commit 425de99 into main Jul 19, 2021
@SimonHeybrock SimonHeybrock deleted the Issue2049_add_xarray_pandas_converters branch July 19, 2021 11:28
@SimonHeybrock
Copy link
Member

Added link to these new features in docs and release notes as part of #2058.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add from_pandas and from_xarray
2 participants