Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement of xarray.Dataset.from_dataframe #9015

Open
loco-philippe opened this issue May 7, 2024 · 5 comments
Open

Enhancement of xarray.Dataset.from_dataframe #9015

loco-philippe opened this issue May 7, 2024 · 5 comments

Comments

@loco-philippe
Copy link

loco-philippe commented May 7, 2024

Is your feature request related to a problem?

The current xarray.Dataset.from_dataframe method converts DataFrame columns corresponding to non-index coordinates into variables as explained in the user-guide.

This solution is not optimal because it does not recover the structure of the initial data.
It also creates large datasets.

The user-guide example is below:

In [1]: ds = xr.Dataset(
              {"foo": (("x", "y"), np.random.randn(2, 3))},
              coords={
                  "x": [10, 20],
                  "y": ["a", "b", "c"],
                  "along_x": ("x", np.random.randn(2)),
                  "scalar": 123,
              },
         )
         ds
Out[1]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

In [2]: df = ds.to_dataframe()
        xr.Dataset.from_dataframe(df)
Out[2]:
<xarray.Dataset> Size: 152B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) object 24B 'a' 'b' 'c'
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287
    along_x  (x, y) float64 48B -0.03376 -0.03376 -0.03376 0.8059 0.8059 0.8059
    scalar   (x, y) int32 24B 123 123 123 123 123 123

/

Describe the solution you'd like

If we analyse the relationships between columns, we can distinguish between variables, dims coordinates and non_dims coordinates.

In the example above, the round-trip conversion with npd return also the same dataset:

In [3]: df.npd.to_xarray()
Out[3]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

Note:

  • npd is the ntv_pandas package (present in the pandas ecosystem). This package is capable of converting complex DataFrame (see examples).

Describe alternatives you've considered

Three options are available to you to have an efficient converter,

  • option 1: maintain the current xarray.Dataset.from_dataframe and use the npd third-party solution to have an optimized converter
  • option 2: reuse the analysis package to find dims, coordinates and variables, then modify the xarray.Dataset.from_dataframe method to generate a dataset,
  • option 3: include the analysis functions in the xarray.Dataset.from_dataframe method

It seems to me that the option 3 is complex.
The option 1 and option 2 are possible

Additional context

The analysis (package tab_analysis) applied to the example above gives the results below:

In [4]: analys = df.reset_index().npd.analysis(distr=True)
        analys.partitions()
Out[4]: [['x', 'y'], ['foo']] # two partitions (dims) are found

In [5]: analys.field_partition() # use the first partition : ['x', 'y']
Out[5]: 
{'primary': ['x', 'y'],
 'secondary': ['along_x'],
 'mixte': [],
 'unique': ['scalar'],
 'variable': ['foo']}

In [6]: analys.relation_partition()
Out[6]: {'x': ['x'], 'y': ['y'], 'along_x': ['x'], 'scalar': [], 'foo': ['x', 'y']}
@max-sixty
Copy link
Collaborator

This looks very cool!

I think the first thing we could do is add a link to the library from the documentation — at least the from_dataframe method...

@loco-philippe
Copy link
Author

@max-sixty

Thank-you Maximilian for your quick response !

Yes it's a good idea, do you need any additional information for this ?

By the way, i'm looking to see if another theory of tabular structure analysis (see presentation) exists but I can't find references. Do you have some contacts or some references about that ?

@max-sixty
Copy link
Collaborator

Yes it's a good idea, do you need any additional information for this ?

This would be a PR you could make to the docs!

@loco-philippe
Copy link
Author

OK, that's perfect!

I will prepare a modification of the 'doc/user-guide/pandas.rst' file and then include it in a PR.

Can you confirm that it is not necessary to create a development environment?

@max-sixty
Copy link
Collaborator

Can you confirm that it is not necessary to create a development environment?

No it shouldn't be required!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants