Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataArray.to_dask_dataframe() #7635

Merged
merged 75 commits into from
Apr 28, 2023

Conversation

dsgreen2
Copy link
Contributor

@dsgreen2 dsgreen2 commented Mar 16, 2023

Adds a method to_dask_dataframe() to convert a dataarray to a dask dataframe.

I have added the function to_dask_dataframe() in dataarray.py . This implementation is as suggested in issue #7409 . The function first converts the data array to a temporary dataset and then calls Dataset.to_dask_dataframe() method.

Could you please review it . Thank you.

Examples
--------

da=xr.DataArray(np.random.rand(4,3,2),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is not correctly formatted, see other functions as reference.

Also, don't use random values in examples, simply use np.ones(...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.ones can hide errors when dealing with tricky shapes so something like np.arange(4*3*2).reshape(4,3,2) is a little better.

@@ -3205,6 +3205,39 @@ def test_to_dataframe_0length(self) -> None:
assert len(actual) == 0
assert_array_equal(actual.index.names, list("ABC"))

def test_to_dask_dataframe(self) -> None:
arr_np = np.random.randn(3, 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it doesn't really matter in most cases, we try to avoid random values in tests.

Maybe use np.arange(3*4).reshape(3,4).

if self.ndim == 0:
raise ValueError("Cannot convert a scalar to a dataframe")

tmp_dataset = Dataset({name: self})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we use the to tmp dataset method here, but since we only use it to construct the data frame and don't roundtrip it doesn't actually matter?

xarray/core/dataarray.py Outdated Show resolved Hide resolved
Comment on lines 6686 to 6687
dim_order: Sequence of Hashable or None , optional
Hierarchical dimension order for the resulting dataframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dim_order: Sequence of Hashable or None , optional
Hierarchical dimension order for the resulting dataframe.
dim_order: Sequence of Hashable or None , optional
Hierarchical dimension order for the resulting dataframe.

Follow numpys docstring conventions. More errors above and below.

xarray/core/dataarray.py Outdated Show resolved Hide resolved
Comment on lines 6725 to 6732
if name is None:
name = self.name

if name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if name is None:
name = self.name
if name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter"
)

Not needed when using self._to_dataset_whole.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be better to keep this error message ? When I removed it, The error shown was ' unable to convert unnamed DataArray to a Dataset without providing an explicit name ' . Keeping these lines can show the error message specific to dataarray to daskdataframe conversion.

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
Examples
--------

da=xr.DataArray(np.random.rand(4,3,2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.ones can hide errors when dealing with tricky shapes so something like np.arange(4*3*2).reshape(4,3,2) is a little better.

@dsgreen2
Copy link
Contributor Author

I have made the changes as suggested. Please review them .Thanks

Comment on lines 6732 to 6741
if name is None:
name = self.name

if name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter ."
)
ds = self._to_dataset_whole(name)
return ds.to_dask_dataframe(dim_order, set_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if name is None:
name = self.name
if name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter ."
)
ds = self._to_dataset_whole(name)
return ds.to_dask_dataframe(dim_order, set_index)
name = self.name if self.name is not None else _THIS_ARRAY
ds = self._to_dataset_whole(name, shallow_copy=False)
return ds.to_dask_dataframe(dim_order, set_index)

I think we go with this. I don't think it should be necessary to name the dataarray which is more in line with how self._to_temp_dataset works and setting dataarray.name = "new_name" is easy enough.

Comment on lines 6694 to 6700

name : Hashable or None, optional
Name given to this array(required if unnamed).
It is a keyword-only argument. A keyword-only argument can only be passed
to the function using its name as a keyword argument , and not as a
positional argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name : Hashable or None, optional
Name given to this array(required if unnamed).
It is a keyword-only argument. A keyword-only argument can only be passed
to the function using its name as a keyword argument , and not as a
positional argument.

Comment on lines 6681 to 6682
*,
name: Hashable | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*,
name: Hashable | None = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes .Please review them.

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still an issue with the docstring example, probably some whitespace mismatch somewhere. It should just be copy/pasting the results from the ipython console.

@@ -70,6 +71,7 @@ New Features
By `Deepak Cherian <https://github.com/dcherian>`_.
- Improved performance in ``open_dataset`` for datasets with large object arrays (:issue:`7484`, :pull:`7494`).
By `Alex Goodman <https://github.com/agoodm>`_ and `Deepak Cherian <https://github.com/dcherian>`_.
- Added new method :py:meth:`DataArray.to_dask_dataframe`,convert a dataarray into a dask dataframe (:issue:`7409`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

xarray/core/dataarray.py Outdated Show resolved Hide resolved
vectors in contiguous order , so the last dimension in this list
will be contiguous in the resulting DataFrame. This has a major influence
on which operations are efficient on the resulting dask dataframe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor Author

@dsgreen2 dsgreen2 Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes .Please review them. Thanks

if self.name is None:
raise ValueError(
"Cannot convert an unnamed DataArray to a "
"dask dataframe : use the ``name`` parameter ."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"dask dataframe : use the ``name`` parameter ."
"dask dataframe : use the ``.rename`` method to assign a name."

@dcherian
Copy link
Contributor

dcherian commented Apr 28, 2023

Thanks for your patience here @dsgreen2 . This is a nice contribution. Welcome to Xarray!

@dcherian dcherian enabled auto-merge (squash) April 28, 2023 14:29
@dcherian dcherian merged commit 25d9a28 into pydata:main Apr 28, 2023
26 checks passed
@welcome
Copy link

welcome bot commented Apr 28, 2023

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

dcherian added a commit to dcherian/xarray that referenced this pull request May 6, 2023
* main:
  Introduce Grouper objects internally (pydata#7561)
  [skip-ci] Add cftime groupby, resample benchmarks (pydata#7795)
  Fix groupby binary ops when grouped array is subset relative to other (pydata#7798)
  adjust the deprecation policy for python (pydata#7793)
  [pre-commit.ci] pre-commit autoupdate (pydata#7803)
  Allow the label run-upstream to run upstream CI (pydata#7787)
  Update asv links in contributing guide (pydata#7801)
  Implement DataArray.to_dask_dataframe() (pydata#7635)
  `ds.to_dict` with data as arrays, not lists (pydata#7739)
  Add lshift and rshift operators (pydata#7741)
  Use canonical name for set_horizonalalignment over alias set_ha (pydata#7786)
  Remove pandas<2 pin (pydata#7785)
  [pre-commit.ci] pre-commit autoupdate (pydata#7783)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement DataArray.to_dask_dataframe()
4 participants