Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_virtual_dataset with dmr++ #113

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

ayushnag
Copy link
Contributor

@ayushnag ayushnag commented May 14, 2024

@TomNicholas TomNicholas added references generation Reading byte ranges from archival files enhancement New feature or request labels May 14, 2024
@ayushnag ayushnag changed the title basic dmr parsing functionality open_dataset with dmr++ May 14, 2024
@TomNicholas TomNicholas changed the title open_dataset with dmr++ open_virtual_dataset with dmr++ May 14, 2024
virtualizarr/xarray.py Outdated Show resolved Hide resolved
virtualizarr/xarray.py Outdated Show resolved Hide resolved
chunk_num = (
chunk_pos // chunks
) # [0,1023,10235] // [1, 1023, 2047] -> [0,1,5]
chunk_key = ".".join(map(str, chunk_num)) # [0,0,1] -> "0.0.1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a join function in virtualizarr.zarr for doing this.

virtualizarr/dmrpp.py Outdated Show resolved Hide resolved
@agoodm
Copy link

agoodm commented May 15, 2024

Thanks for taking a look and giving my suggested changes to the chunk key parsing a try @ayushnag !

Continuing the discussion on performance I think the remaining bottlenecks (aside from your point about I/O in the cloud maybe) with this now lie primarily outside the scope of this work, and I don't expect changing XML readers to make a significant improvement.

Comment on lines 72 to 73
group : str, default None
Group path within the dataset to open. For example netcdf4 and hdf5 groups
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to separate out the addition of this kwarg into a separate pull request, and implement it for the existing HDF5 reader. Then this PR wouldn't need to change the API of open_virtual_dataset.

virtualizarr/readers/dmrpp.py Show resolved Hide resolved
@@ -0,0 +1,331 @@
from typing import Optional
from xml.etree import ElementTree as ET
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only extra import required? (And this is a built-in python library module right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the only extra import and it is built in

Copy link
Contributor Author

@ayushnag ayushnag Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The int32 to int64 change had to be made since I ran into some large byte offsets with the Atlas ICE-SAT dataset. Here is an example error: OverflowError: Python integer 6751178683 out of bounds for int32

@ayushnag
Copy link
Contributor Author

ayushnag commented Jun 27, 2024

Some questions about writing unit tests:

  • How to load test dmrpp’s?
    • These files are available over https but need netrc login (NASA Earthdata authentication)
    • I will check how earthaccess gets creds and does testing
  • What should I compare my result to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading from dmrcp index files?
3 participants