Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap open_dataset and open_mfdataset to flexibly open datasets #385

Merged
merged 9 commits into from
Mar 6, 2023

Conversation

pochedls
Copy link
Collaborator

@pochedls pochedls commented Nov 9, 2022

Description

This PR is intended to wrap xc.open_dataset() and xc.open_mfdataset() such that xcdat can flexibly open the dataset path passed in (whether it is a directory, wildcard search, xml file, single file, etc.).

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@github-actions github-actions bot added the type: enhancement New enhancement request label Nov 9, 2022
@codecov-commenter
Copy link

codecov-commenter commented Nov 9, 2022

Codecov Report

Patch coverage: 100.00% and no project coverage change

Comparison is base (1efb807) 100.00% compared to head (b53a6e7) 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #385   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1266      1293   +27     
=========================================
+ Hits          1266      1293   +27     
Impacted Files Coverage Δ
xcdat/dataset.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@pochedls
Copy link
Collaborator Author

pochedls commented Nov 9, 2022

I created this draft PR in part to fine-tune the proposed API and so others could identify (and help fix?) issues with the approach in this PR.

I'm wondering if we should just make xcdat.open_dataset() open directories, xml files, individual files, lists, etc. (rather than xcdat.open(), which is what is in the initial PR and what is outlined in Issue #128. Or maybe it's better to distinguish the proposed functionality from open_dataset and open_mfdataset (that way you could always fall back to these foundational functions).

There's not a lot of error handling in the inital PR and I assume there are gaps in the PR logic (and other cases to be considered aside from xml/directory/list/etc). Help identifying these would be useful.

A big assumption in the current PR is that when the user passes in a directory, xcdat should try to open all files in the directory as a multi-file dataset. This gets around having to assume they are all .nc files, but if a directory has mixed files, this could be a problem.

@pochedls pochedls requested a review from lee1043 November 9, 2022 23:19
@pochedls pochedls marked this pull request as draft November 9, 2022 23:19
@lee1043
Copy link
Collaborator

lee1043 commented Nov 10, 2022

@pochedls I like the idea of having open wrapper that can also open xml. I have drafted the similar ad-hoc function in the PMP because the current PMP workflow rely on xml. I was considering my function would be used for interim until the PMP workflow to adapt your xsearch. I like the way you coded that is looking great and solid. I'd happy to give a test when you feel this PR is ready.

@lee1043
Copy link
Collaborator

lee1043 commented Nov 11, 2022

@pochedls I confirm open warpper is working as expected. I tried with giving xml (str), directory path for netcdf files (str), list of netcdf files (list), and single netcdf file (str) to the xcdat.open. All work as expected! Thank you very much for adding this convenient wrapper. I am happy with the addition.

@tomvothecoder
Copy link
Collaborator

Thanks for this PR. I'll comment once I take a closer look.

@tomvothecoder
Copy link
Collaborator

Notes from meetings:

  • Consider other file types
  • HDFs might be tricky – embedded datasets into one another?

Tom will review PR

  • Naming of function (open vs. open_dataset/open_mfdataset())
  • Potential inconsistency with xarray usage
  • Make explicit that open wraps open_dataset()/open_mfdataset()
  • Handling mixed file types, file types outside supported xarray

@lee1043
Copy link
Collaborator

lee1043 commented Feb 15, 2023

Naming of function (open vs. open_dataset/open_mfdataset())

I think we can consider open as an addition to existing open_dataset/open_mfdataset, so "plus" rather than "vs".

xcdat/dataset.py Outdated
Comment on lines 40 to 61
def open(
path: str,
data_var: Optional[str] = None,
add_bounds: bool = True,
decode_times: bool = True,
center_times: bool = False,
lon_orient: Optional[Tuple[float, float]] = None,
**kwargs,
) -> xr.Dataset:
"""
Documentation to be written when this PR is
further along...
"""

# inspect path attributes and use either open_dataset
# or open_mfdataset as appropriate
# * list (open list of files with open_mfdataset)
# * pathlib.Path (open pathlib.Path with open_mfdataset)
# * deprecated CDAT xml file type
# * directory (opens with open_dataset or open_mfdataset
# depending in number of files)
# * wildcard input (opens data with open_mfdataset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the motivation behind adding a new open function is to provide an abstracted API that determines whether to use open_dataset()/open_mfdataset() based on the path argument to make opening datasets hopefully more intuitive. Additionally, open would include support for the deprecated CDAT XML format.

Let me know if this isn't correct.

xcdat/dataset.py Outdated
# or open_mfdataset as appropriate
# * list (open list of files with open_mfdataset)
# * pathlib.Path (open pathlib.Path with open_mfdataset)
# * deprecated CDAT xml file type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions I had about supporting XML:

  • I noticed in the docstring it says "deprecated CDAT xml file type". Is the structure of the XML formatted specifically by CDAT, and are there plans to eventually move away from it?
    • I think if users pass XML files that don't align with a specific format, then open will break. We'd have to include documentation that the XML file must follow a specific format (not sure if this is appropriate to support though since it isn't generalized like a list of paths).
  • I'm not really familiar with XML usage. How common is the XML format used for reading datasets in the community?

Copy link
Collaborator Author

@pochedls pochedls Feb 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The structure of CDAT xml files (climate data markup language) is a dialect of xml files that have a defined set of attributes.
  • We only care that the xml file has a directory attribute – we can probably just return an error saying "the xml file provided does not have a root directory attribute" or something like that
  • XML probably isn't used that widely by the broad climate community, but it is used pretty widely by former CDAT users

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, this makes sense to me. I think being explicit about these details as you describe is a good idea.

xcdat/dataset.py Outdated
Comment on lines 99 to 119
# use open_dataset or open_mfdataset depending on provided path
if mfdataset:
ds = open_mfdataset(
path,
data_var=data_var,
add_bounds=add_bounds,
decode_times=decode_times,
center_times=center_times,
lon_orient=lon_orient,
**kwargs,
)
else:
ds = open_dataset(
path,
data_var=data_var,
add_bounds=add_bounds,
decode_times=decode_times,
center_times=center_times,
lon_orient=lon_orient,
**kwargs,
)
Copy link
Collaborator

@tomvothecoder tomvothecoder Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this logic for determining whether to use open_dataset() or open_mfdataset().

Note, open_mfdataset() supports opening a single dataset file and multiple dataset files. The main difference (that I'm aware of) with opening a single dataset with open_dataset() vs. open_mfdataset() is that data variables are automatically Dask arrays with open_mfdataset().

I'm not sure of pros/cons, implications (e.g. performance), etc. of using open_dataset() vs. open_mfdataset() for a single dataset file, but I'd probably lean towards always using open_mfdataset() if you're looping over models or lots of directories. This would eliminate the need for determining which of these APIs to use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I loop over models I currently use open_mfdataset universally: xc.open_mfdataset(dpath + '*.nc'). xc.open_dataset() used to be faster than xc.open_mfdataset(), though this may have been addressed a few months ago (I remember discussing this, perhaps as part of the large PR that changes the logic in dealing with coords/bounds and I think it may have been addressed).

I just checked for one dataset and they were close (24 versus 21 ms), though I should look to see if both functions are close in speed across a broader range of datasets.

Copy link
Collaborator

@tomvothecoder tomvothecoder Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative to a separate open() function, what are your thoughts on extending xcdat.open_mfdataset() to support CDAT XML?

This seems more intuitive to me because xarray.open_mfdataset() supports glob with wildcard, which returns a list of file paths in the directory regardless if there is 1 or more files. We can extend xcdat.open_mfdataset() to parse the CDAT XML files similar to glob and set path arg to the list of file paths.

As mentioned previously, my general recommendation for users who expect to loop over directories and models is to use open_mfdataset() universally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with this, but ideally open_mfdataset() could also handle a string directory. I think this would handle all cases in this PR.

I think I personally would then just use open_mfdataset().

The case where open() would still be useful is if open_dataset() is faster than open_mfdataset(); open() would be faster / better by intelligently using open_mfdataset() only when necessary.

Copy link
Collaborator

@tomvothecoder tomvothecoder Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case where open() would still be useful is if open_dataset() is faster than open_mfdataset(); open() would be faster / better by intelligently using open_mfdataset() only when necessary.

I found some xarray GH issues related to the slow performance of open_mfdataset(), which are mostly related to coordinate compatibility checks when concatenating files and decoding time coordinates with more than 1 dataset. There are workarounds for users if they find open_mfdataset() too slow.

For our specific case involving opening only 1 dataset with open_dataset() vs. open_mfdataset(), I don't think the performance difference is that significant because coordinate concatenation isn't necessary like it is with multiple datasets. I seems safe to extend open_mfdataset(), but some performance metrics would be nice to confirm this.

I'd be fine with this, but ideally open_mfdataset() could also handle a string directory. I think this would handle all cases in this PR.

I'll take a look at your implementation and suggest some changes to extend open_mfdataset().

Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pochedls and @lee1043, this PR is ready for review.

It was pretty easy to adapt @pochedls's code to extend open_mfdataset(). The hardest part was figuring out how to generate a dummy XML file for the unit tests, but I figured it out lol.

xcdat/dataset.py Outdated
Comment on lines 208 to 212
if not isinstance(paths, list):
if _is_paths_to_xml(paths):
paths = _parse_xml_for_nc_glob(paths) # type: ignore
elif os.path.isdir(paths):
paths = _parse_dir_for_nc_glob(paths) # type: ignore
Copy link
Collaborator

@tomvothecoder tomvothecoder Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new conditional in open_mfdataset() for the paths argument.

Comment on lines +370 to +385
def _is_paths_to_xml(paths: Union[str, pathlib.Path]) -> bool:
"""Checks if the ``paths`` argument is a path to an XML file.

Parameters
----------
paths : Union[str, pathlib.Path]
A string or pathlib.Path represnting a file path.

Returns
-------
bool
"""
if isinstance(paths, str):
return paths.split(".")[-1] == "xml"
elif isinstance(paths, pathlib.Path):
return paths.parts[-1].endswith("xml")
Copy link
Collaborator

@tomvothecoder tomvothecoder Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to support pathlib.Path alongside str.

xcdat/dataset.py Outdated
Comment on lines 388 to 431
def _parse_xml_for_nc_glob(xml_path: str) -> str:
"""Parses a CDAT XML file for a glob of `*.nc` paths.

The CDAT "Climate Data Markup Language" (CDML) is a dialect of XML with a
defined set of attributes. The "directory" attribute pointing to a directory
of `.nc` files is extracted from the specified XML file. Refer to [6]_ for
more information on CDML.

Parameters
----------
xml_path : str
The CDML XML file path.

Returns
-------
str
A glob of `*.nc` paths in the directory.

Notes
-----
CDML is a deprecated XML dialect that is still used by current and former
users in the CDAT community. xCDAT supports CDML to enable these users to
more easily integrate xCDAT into their existing CDML/CDAT-based workflows.

References
----------
.. [6] https://cdms.readthedocs.io/en/latest/manual/cdms_6.html
"""
# `resolve_entities=False` and `no_network=True` guards against XXE attacks.
# Source: https://rules.sonarsource.com/python/RSPEC-2755
parser = etree.XMLParser(resolve_entities=False, no_network=True)
tree = etree.parse(xml_path, parser)
root = tree.getroot()

dir_attr = root.attrib.get("directory")
if dir_attr is None:
raise KeyError(
f"The XML file ({xml_path}) does not have a 'directory' attribute "
"that points to a directory of `.nc` dataset files."
)

glob_path = dir_attr + "/*.nc"

return glob_path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New function _parse_xml_for_nc_glob() for parsing CDML XML files.

Comment on lines +434 to +456
def _parse_dir_for_nc_glob(dir_path: str) -> str:
"""Parses a directory for a glob of `*.nc` paths.

Parameters
----------
dir_path : str
The directory.

Returns
-------
str
A glob of `*.nc` paths in the directory.
"""
file_list = os.listdir(dir_path)

if len(file_list) == 0:
raise ValueError(
f"The directory {dir_path} has no netcdf (`.nc`) files to open."
)

glob_path = dir_path + "/*.nc"

return glob_path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New function _parse_dir_for_nc_glob for parsing a directory

xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
Comment on lines +416 to +421
# `resolve_entities=False` and `no_network=True` guards against XXE attacks.
# Source: https://rules.sonarsource.com/python/RSPEC-2755
parser = etree.XMLParser(resolve_entities=False, no_network=True)
tree = etree.parse(xml_path, parser)
root = tree.getroot()

Copy link
Collaborator

@tomvothecoder tomvothecoder Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using lxml with resolve_entities=False and no_network=True to avoid XXE attacks.
These settings aren't easily configurable with ElementTree from the Python standard lib.

More info on XXE attacks:

Comment on lines 629 to 696
def test_raises_error_if_xml_does_not_have_root_direcory(self):
ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
ds1.to_netcdf(self.file_path1)
ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
ds2 = ds2.rename_vars({"ts": "tas"})
ds2.to_netcdf(self.file_path2)

# Create the XML file
xml_path = f"{self.dir}/datasets.xml"
page = etree.Element("dataset")
doc = etree.ElementTree(page)
doc.write(xml_path, xml_declaration=True, encoding="utf-16")

with pytest.raises(KeyError):
open_mfdataset(xml_path, decode_times=True)

def test_opens_datasets_from_xml_using_str_path(self):
ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
ds1.to_netcdf(self.file_path1)
ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
ds2 = ds2.rename_vars({"ts": "tas"})
ds2.to_netcdf(self.file_path2)

# Create the XML file
xml_path = f"{self.dir}/datasets.xml"
page = etree.Element("dataset", directory=str(self.dir))
doc = etree.ElementTree(page)
doc.write(xml_path, xml_declaration=True, encoding="utf-16")

result = open_mfdataset(xml_path, decode_times=True)
expected = ds1.merge(ds2)

result.identical(expected)

def test_opens_datasets_from_xml_using_pathlib_path(self):
ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
ds1.to_netcdf(self.file_path1)
ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
ds2 = ds2.rename_vars({"ts": "tas"})
ds2.to_netcdf(self.file_path2)

# Create the XML file
xml_path = self.dir / "datasets.xml"
page = etree.Element("dataset", directory=str(self.dir))
doc = etree.ElementTree(page)
doc.write(xml_path, xml_declaration=True, encoding="utf-16")

result = open_mfdataset(xml_path, decode_times=True)
expected = ds1.merge(ds2)

result.identical(expected)

def test_raises_error_if_directory_has_no_netcdf_files(self):
with pytest.raises(ValueError):
open_mfdataset(str(self.dir), decode_times=True)

def test_opens_netcdf_files_in_a_directory(self):
ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
ds1.to_netcdf(self.file_path1)
ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
ds2 = ds2.rename_vars({"ts": "tas"})
ds2.to_netcdf(self.file_path2)

result = open_mfdataset(str(self.dir), decode_times=True)
expected = ds1.merge(ds2)

result.identical(expected)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New unit tests covering XML and directory paths argument

@tomvothecoder tomvothecoder marked this pull request as ready for review February 22, 2023 23:52
Copy link
Collaborator Author

@pochedls pochedls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I think we could be more sophisticated in globbing directories. We could see if all files are either .nc, .hdf, .grib, etc. and, if so, open them as a multi-file dataset. If the files are mixed, we can throw an error. This sort of functionality can be left to future development (and worked on when someone identifies this need).

@tomvothecoder
Copy link
Collaborator

This looks good to me. I think we could be more sophisticated in globbing directories. We could see if all files are either .nc, .hdf, .grib, etc. and, if so, open them as a multi-file dataset. If the files are mixed, we can throw an error. This sort of functionality can be left to future development (and worked on when someone identifies this need).

Great idea, I agree with you. I'll go ahead and merge this PR!

@tomvothecoder tomvothecoder self-assigned this Mar 6, 2023
@tomvothecoder tomvothecoder added this to In progress in Next Release (v0.5.0) via automation Mar 6, 2023
@tomvothecoder tomvothecoder merged commit e6dbed5 into main Mar 6, 2023
Next Release (v0.5.0) automation moved this from In progress to Done Mar 6, 2023
@tomvothecoder tomvothecoder deleted the feature/128-flexible-open branch March 6, 2023 18:03
@tomvothecoder tomvothecoder mentioned this pull request Jun 9, 2023
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Add xcdat.open() to wrap xcdat.open_dataset() and xcdat.open_mfdataset()
4 participants