New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add from_pandas
and from_xarray
#2054
Conversation
for attrib in xarray_dataset.attrs: | ||
attrib_name = "{}{}".format(attrib_prefix, attrib) | ||
if attrib_name in sc_data: | ||
raise ValueError( | ||
f"Attribute {attrib} would collide with an existing " | ||
f"data object '{attrib_name}' (using attrib_prefix " | ||
f"'{attrib_prefix}'). Specify a different attrib_prefix " | ||
f"in the call to from_xarray_dataset") | ||
|
||
sc_data[attrib_name] = scipp.scalar(xarray_dataset.attrs[attrib]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this is the correct approach in scipp or not, scipp.DataSet
doesn't seem to support attrs
directly unlike scipp.DataArray
? The alternative would probably be to drop the xarray attrs (possibly with a warning about loss of information?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dropping is likely the correct approach, especially given that xarray
coords (which are also xr.DataArray
) can also have attributes, and we don't have a way of supporting that either.
from_pandas
and from_xarray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start!
I haven't implemented any unit handling as it seems to be done in attributes for xarray, and parsing strings felt like it may be a bit too error-prone. Happy to reconsider if there are strong opinions about this.
If you refer to parsing unit strings, that is supported already (using another library under the hood), you can, e.g., sc.Unit('m/s')
. The question is maybe how to find attributes that represent a unit? xarray
appears to be commonly used with NetCDF and it looks like the corresponding attribute name is 'units'
. Thus I might suggest looking for this attribute and use it as the unit. We may also consider supporting singular 'unit'
.
Another things to consider for from_xarray
: xarray
distinguishes "index coords" (marked with *
) and other coords. In terms of how they behave we should:
- Convert "index coords" into scipp coords
- Convert other coord into scipp attrs
See below for other comments:
@@ -0,0 +1,47 @@ | |||
import pandas | |||
|
|||
import scipp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is part of scipp
cannot import scipp
(without running into recursion). Rather use things such as:
from .. import scalar
@@ -0,0 +1,47 @@ | |||
import pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to import this file (or functions from it) in __init__.py
we have to make this import optional, for example import only when from_pandas
is called. Otherwise this turns into a mandatory scipp dependency (module won't load unless pandas is installed).
sc_attribs[attr] = scipp.scalar(dataframe.attrs[attr]) | ||
|
||
row_index = dataframe.axes[0] | ||
rows_index_name = row_index.name or "pandas_row" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just 'row'
?
) | ||
} | ||
|
||
if dataframe.ndim == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think what is done below will work in general: What pandas
calls "2 dimensions" for a DataFrame
is actually a 1-dimensional dataset. I think what we should do is to return a 1-D Dataset
. dataframe.axes[1]
basically gives the names of the dataset entries.
One might consider returning a DataArray
if there is only a single column, but maybe that is too surprising and should be avoided.
) | ||
|
||
return scipp.DataArray( | ||
data=scipp.Variable(values=dataframe.to_numpy(), dims=sc_dims), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that to_numpy
does type conversions of columns if column dtype
mismatches. Even worse, if there are things such as string columns the output array has object
dtype (which is inefficient and not well supported by scipp). Basically: See above, return 1-D Dataset
instead.
for data_key in xarray_dataset.data_vars: | ||
sc_data[data_key] = from_xarray_dataarray( | ||
xarray_dataset.data_vars[data_key]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for data_key in xarray_dataset.data_vars: | |
sc_data[data_key] = from_xarray_dataarray( | |
xarray_dataset.data_vars[data_key]) | |
items = { key:from_xarray(var) for key, var in ds.data_vars.items() } |
would be briefer
sc_coords[coord] = scipp.Variable( | ||
dims=[coord], | ||
values=xarray_dataset.coords[coord].values, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (make variable from xarray object) is repeated a couple of times, also in from_xarray_dataarray
above. Can we use common code? xarray also has a Variable
type, but I am not sure that is what you get from xarray_dataset.coords[coord]
(that might be a DataArray
), so I don't know if you can automatically do this using from_xarray
, or need a helper such as _from_xarray_variable
:
coords = { key:_from_xarray_variable(var) for key, var in ds.coords.items() }
(ideally we'd want to simply use coords = { key:from_xarray(var) for key, var in ds.coords.items() }
, but you will have to see if you can figure out automatically that it should create a Variable
and not a DataArray
).
for attrib in xarray_dataset.attrs: | ||
attrib_name = "{}{}".format(attrib_prefix, attrib) | ||
if attrib_name in sc_data: | ||
raise ValueError( | ||
f"Attribute {attrib} would collide with an existing " | ||
f"data object '{attrib_name}' (using attrib_prefix " | ||
f"'{attrib_prefix}'). Specify a different attrib_prefix " | ||
f"in the call to from_xarray_dataset") | ||
|
||
sc_data[attrib_name] = scipp.scalar(xarray_dataset.attrs[attrib]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dropping is likely the correct approach, especially given that xarray
coords (which are also xr.DataArray
) can also have attributes, and we don't have a way of supporting that either.
assert (sc_dataarray.values == scipp.array(dims=["pandas_row"], | ||
values=[1, 2, 3]).values).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use
assert (sc_dataarray.values == scipp.array(dims=["pandas_row"], | |
values=[1, 2, 3]).values).all() | |
assert sc.identical(sc_dataarray, scipp.array(dims=["pandas_row"], values=[1, 2, 3])) |
This will also ensure that unit, dimension labels, coords, ... are correct. Same goes for everything below, you should basically never assert
using the values
or value
property.
from scipp.compat.xarray_compat import from_xarray | ||
|
||
|
||
def test_empty_attrs_dataarray(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't see a test with coords
?
…han a single 2D DataArray
coords={ | ||
rows_index_name: | ||
Variable(dims=[rows_index_name], | ||
values=row_index) | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same coord that gets added below? No need to repeat it here?
}, | ||
name=column_name or "") | ||
|
||
return Dataset(data=sc_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get a Series
, should we return a DataArray
(or even just a Variable
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified the code so that if a Series
is passed in, a DataArray
is returned rather than a Dataset
. However for the 1-D Dataframe
case it is still returning a Dataset
(with 1 item) rather than DataArray
because I felt that changing the return type based on the number of dimensions (rather than the type of object being used) would be too confusing as an API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
""" | ||
Converts xarray coords to scipp coords. | ||
|
||
Note: only converts dimension coords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code doesn't look like that?
coords = {} | ||
|
||
for coord_name, coord in xr_obj.coords.items(): | ||
is_index = (coord_name in xr_obj.indexes) | ||
if is_index == index: | ||
coords[coord_name] = Variable(dims=coord.dims, values=coord.values) | ||
return coords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
coords = {}
attrs = {}
for name, coord in xr_obj.coords.items():
if name in xr_obj.indexes:
coords[name] =
else
attrs[name] = _var_from_xarray(coord)
return coords, attrs
with
def _var_from_xarray(xr_obj):
unit = coord.attrs['units'] if units in coord.attrs else ''
return Variable(dims=coord.dims, values=coord.values, unit=unit)
The latter should be used everywhere, since we want to look for units everywhere.
:param ds: an xarray.Dataset object to be converted. | ||
:return: the converted scipp dataset object. | ||
""" | ||
sc_data = {k: from_xarray(v) for k, v in ds.data_vars.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the ds.data_vars.items()
needed? What is the difference to ds.items()
?
if name in da.indexes: | ||
coords[name] = _var_from_xarray(coord) | ||
else: | ||
attrs[f"coord_{name}"] = _var_from_xarray(coord) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be renaming the coords? I suppose you do this to avoid clashes with xarray-attributes? I think we should either rename the xarray-attributes, or simply drop them if there is a name clash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Sorry, had posted comment as individual comment above, ignore this.
Added link to these new features in docs and release notes as part of #2058. |
Fixes #2049
attr
handling where it makes sense and is supported by both source and targetxarray
, and parsing strings felt like it may be a bit too error-prone. Happy to reconsider if there are strong opinions about this.