Integration with Hugging Face Datasets #60

rabernat · 2022-04-21T15:57:41Z

I've recently been learning about Hugging Face Datasets. It's a great data sharing platform for ML. The datasets package is based on tensorflow datasets.

It would be great to think about how to best integrate Xarray and Xbatcher with huggingface datasets. Opening this issue just as a placeholder. Will update with more detail as I explore.

The text was updated successfully, but these errors were encountered:

weiji14 · 2022-04-27T21:30:14Z

Not to hijack this thread, but just found out about xbatcher and was wondering how this fits into the ML ecosystem, and if there are ways we can share efforts and avoid reinventing the wheel. I've started a Pull Request recently at microsoft/torchgeo#509 to connect xarray datasets (technically via rioxarray) to torchgeo, and was pleasantly surprised to have found that xbatcher has implemented something similar a year ago at #25 already!

Will be happy to hear any thoughts on this, I might pop in for the Pangeo ML Working Group meeting to discuss this.

cc @adamjstewart

maxrjones · 2022-04-28T19:22:40Z

Hi @weiji14! @jhamman added this issue and your discussion points to the agenda for the next Pangeo ML Working Group meeting. I would be excited to discuss opportunities to share efforts. I'm just starting to work on xbatcher and also plan to attend next week's meeting.

weiji14 · 2022-04-28T19:43:01Z

Oh hi Meghan! It always surprises me how small the open source world is 😆 Will definitely see what others are up to next Monday. My initial impression was to think of it in terms of a Pytorch/Tensorflow split, or to have the two libraries specialize in terms of loading from a GeoTIFF/Zarr file vs in-memory xarray objects. But the lines aren't quite as clear cut, and given that Pytorch 1.11 recently introduced TorchData/DataPipes, it'll be good to put some smart people together and think about what's the best way forward.

rabernat · 2022-04-28T21:13:37Z

or to have the two libraries specialize in terms of loading from a GeoTIFF/Zarr file vs in-memory xarray objects

There might not be such a different between these two approaches, if you remove the "in-memory" part. When you open data with Xarray it is automatically "lazy" about loading it into memory. It just puts a light "lazy indexing" wrapper around the underlying array in a GeoTiff / Zarr / NetCDF / Grib file. A downstream library (xbatcher, pytorch, etc.) can use these arrays in a streaming fashion. The advantage of using Xarray as a loader is that it already speaks all the weird file formats. The disadvantage is that there is some overhead creating Dataset, particularly around eager loading of coordinates. There may be workarounds for those, particularly post-Xarray-indexes-refactor.

rabernat · 2022-05-02T15:25:13Z

Twitter thread related to huggingface and Zarr: https://twitter.com/rabernat/status/1517182069943713792

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with Hugging Face Datasets #60

Integration with Hugging Face Datasets #60

rabernat commented Apr 21, 2022

weiji14 commented Apr 27, 2022 •

edited

Loading

maxrjones commented Apr 28, 2022

weiji14 commented Apr 28, 2022 •

edited

Loading

rabernat commented Apr 28, 2022

rabernat commented May 2, 2022

Integration with Hugging Face Datasets #60

Integration with Hugging Face Datasets #60

Comments

rabernat commented Apr 21, 2022

weiji14 commented Apr 27, 2022 • edited Loading

maxrjones commented Apr 28, 2022

weiji14 commented Apr 28, 2022 • edited Loading

rabernat commented Apr 28, 2022

rabernat commented May 2, 2022

weiji14 commented Apr 27, 2022 •

edited

Loading

weiji14 commented Apr 28, 2022 •

edited

Loading