Enhancement of xarray.Dataset.from_dataframe #9015

loco-philippe · 2024-05-07T21:32:25Z

Is your feature request related to a problem?

The current xarray.Dataset.from_dataframe method converts DataFrame columns corresponding to non-index coordinates into variables as explained in the user-guide.

This solution is not optimal because it does not recover the structure of the initial data.
It also creates large datasets.

The user-guide example is below:

In [1]: ds = xr.Dataset(
              {"foo": (("x", "y"), np.random.randn(2, 3))},
              coords={
                  "x": [10, 20],
                  "y": ["a", "b", "c"],
                  "along_x": ("x", np.random.randn(2)),
                  "scalar": 123,
              },
         )
         ds
Out[1]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

In [2]: df = ds.to_dataframe()
        xr.Dataset.from_dataframe(df)
Out[2]:
<xarray.Dataset> Size: 152B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) object 24B 'a' 'b' 'c'
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287
    along_x  (x, y) float64 48B -0.03376 -0.03376 -0.03376 0.8059 0.8059 0.8059
    scalar   (x, y) int32 24B 123 123 123 123 123 123

/

Describe the solution you'd like

If we analyse the relationships between columns, we can distinguish between variables, dims coordinates and non_dims coordinates.

In the example above, the round-trip conversion with npd return also the same dataset:

In [3]: df.npd.to_xarray()
Out[3]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

Note:

npd is the ntv_pandas package (present in the pandas ecosystem). This package is capable of converting complex DataFrame (see examples).

Describe alternatives you've considered

Three options are available to you to have an efficient converter,

option 1: maintain the current xarray.Dataset.from_dataframe and use the npd third-party solution to have an optimized converter
option 2: reuse the analysis package to find dims, coordinates and variables, then modify the xarray.Dataset.from_dataframe method to generate a dataset,
option 3: include the analysis functions in the xarray.Dataset.from_dataframe method

It seems to me that the option 3 is complex.
The option 1 and option 2 are possible

Additional context

The analysis (package tab_analysis) applied to the example above gives the results below:

In [4]: analys = df.reset_index().npd.analysis(distr=True)
        analys.partitions()
Out[4]: [['x', 'y'], ['foo']] # two partitions (dims) are found

In [5]: analys.field_partition() # use the first partition : ['x', 'y']
Out[5]: 
{'primary': ['x', 'y'],
 'secondary': ['along_x'],
 'mixte': [],
 'unique': ['scalar'],
 'variable': ['foo']}

In [6]: analys.relation_partition()
Out[6]: {'x': ['x'], 'y': ['y'], 'along_x': ['x'], 'scalar': [], 'foo': ['x', 'y']}

The text was updated successfully, but these errors were encountered:

max-sixty · 2024-05-07T22:31:57Z

This looks very cool!

I think the first thing we could do is add a link to the library from the documentation — at least the from_dataframe method...

loco-philippe · 2024-05-08T08:06:58Z

@max-sixty

Thank-you Maximilian for your quick response !

Yes it's a good idea, do you need any additional information for this ?

By the way, i'm looking to see if another theory of tabular structure analysis (see presentation) exists but I can't find references. Do you have some contacts or some references about that ?

max-sixty · 2024-05-08T16:54:48Z

Yes it's a good idea, do you need any additional information for this ?

This would be a PR you could make to the docs!

loco-philippe · 2024-05-08T20:43:30Z

OK, that's perfect!

I will prepare a modification of the 'doc/user-guide/pandas.rst' file and then include it in a PR.

Can you confirm that it is not necessary to create a development environment?

max-sixty · 2024-05-08T21:40:32Z

Can you confirm that it is not necessary to create a development environment?

No it shouldn't be required!

loco-philippe added the enhancement label May 7, 2024

loco-philippe mentioned this issue May 10, 2024

User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe #9020

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement of xarray.Dataset.from_dataframe #9015

Enhancement of xarray.Dataset.from_dataframe #9015

loco-philippe commented May 7, 2024 •

edited

max-sixty commented May 7, 2024

loco-philippe commented May 8, 2024

max-sixty commented May 8, 2024

loco-philippe commented May 8, 2024

max-sixty commented May 8, 2024

Enhancement of xarray.Dataset.from_dataframe #9015

Enhancement of xarray.Dataset.from_dataframe #9015

Comments

loco-philippe commented May 7, 2024 • edited

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

max-sixty commented May 7, 2024

loco-philippe commented May 8, 2024

max-sixty commented May 8, 2024

loco-philippe commented May 8, 2024

max-sixty commented May 8, 2024

loco-philippe commented May 7, 2024 •

edited