Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with Hugging Face Datasets #60

Open
rabernat opened this issue Apr 21, 2022 · 5 comments
Open

Integration with Hugging Face Datasets #60

rabernat opened this issue Apr 21, 2022 · 5 comments

Comments

@rabernat
Copy link
Contributor

I've recently been learning about Hugging Face Datasets. It's a great data sharing platform for ML. The datasets package is based on tensorflow datasets.

It would be great to think about how to best integrate Xarray and Xbatcher with huggingface datasets. Opening this issue just as a placeholder. Will update with more detail as I explore.

@weiji14
Copy link
Member

weiji14 commented Apr 27, 2022

Not to hijack this thread, but just found out about xbatcher and was wondering how this fits into the ML ecosystem, and if there are ways we can share efforts and avoid reinventing the wheel. I've started a Pull Request recently at microsoft/torchgeo#509 to connect xarray datasets (technically via rioxarray) to torchgeo, and was pleasantly surprised to have found that xbatcher has implemented something similar a year ago at #25 already!

Will be happy to hear any thoughts on this, I might pop in for the Pangeo ML Working Group meeting to discuss this.

cc @adamjstewart

@maxrjones
Copy link
Member

Hi @weiji14! @jhamman added this issue and your discussion points to the agenda for the next Pangeo ML Working Group meeting. I would be excited to discuss opportunities to share efforts. I'm just starting to work on xbatcher and also plan to attend next week's meeting.

@weiji14
Copy link
Member

weiji14 commented Apr 28, 2022

Oh hi Meghan! It always surprises me how small the open source world is 😆 Will definitely see what others are up to next Monday. My initial impression was to think of it in terms of a Pytorch/Tensorflow split, or to have the two libraries specialize in terms of loading from a GeoTIFF/Zarr file vs in-memory xarray objects. But the lines aren't quite as clear cut, and given that Pytorch 1.11 recently introduced TorchData/DataPipes, it'll be good to put some smart people together and think about what's the best way forward.

@rabernat
Copy link
Contributor Author

or to have the two libraries specialize in terms of loading from a GeoTIFF/Zarr file vs in-memory xarray objects

There might not be such a different between these two approaches, if you remove the "in-memory" part. When you open data with Xarray it is automatically "lazy" about loading it into memory. It just puts a light "lazy indexing" wrapper around the underlying array in a GeoTiff / Zarr / NetCDF / Grib file. A downstream library (xbatcher, pytorch, etc.) can use these arrays in a streaming fashion. The advantage of using Xarray as a loader is that it already speaks all the weird file formats. The disadvantage is that there is some overhead creating Dataset, particularly around eager loading of coordinates. There may be workarounds for those, particularly post-Xarray-indexes-refactor.

@rabernat
Copy link
Contributor Author

rabernat commented May 2, 2022

Twitter thread related to huggingface and Zarr: https://twitter.com/rabernat/status/1517182069943713792

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants