Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyTorch dataloader #25

Merged
merged 9 commits into from
Feb 23, 2022
Merged

Add PyTorch dataloader #25

merged 9 commits into from
Feb 23, 2022

Conversation

jhamman
Copy link
Contributor

@jhamman jhamman commented Aug 12, 2021

This PR is an an initial attempt at adding a set of data loaders. It makes some changes to the base BatchGenerator class and adds a new loaders.torch module to test building real life data loaders.

Comment on lines +147 to +150
def _gen_batches(self) -> dict:
# in the future, we will want to do the batch generation lazily
# going the eager route for now is allowing me to fill out the loader api
# but it is likely to perform poorly.
Copy link
Contributor Author

@jhamman jhamman Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this as something so discuss / work out a design for. It feels quite important that we are able to generate arbitrary batches on the fly. The current implementation eagerly generates batches which will not scale well. However, the pure generator approach doesn't work if you need to randomly access batches (eg via getitem).

# TODO: figure out the dataset -> array workflow
# currently hardcoding a variable name
X_batch = self.X_generator[idx]['x'].torch.to_tensor()
y_batch = self.y_generator[idx]['y'].torch.to_tensor()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flagging that we can't use named tensors here while we wait for pytorch/pytorch#29010

@codecov
Copy link

codecov bot commented Feb 23, 2022

Codecov Report

Merging #25 (8bcd870) into main (802bbd5) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main       #25   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            2         3    +1     
  Lines           77       134   +57     
  Branches        18        30   +12     
=========================================
+ Hits            77       134   +57     
Impacted Files Coverage Δ
xbatcher/accessors.py 100.00% <100.00%> (ø)
xbatcher/generators.py 100.00% <100.00%> (ø)
xbatcher/loaders/torch.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 802bbd5...8bcd870. Read the comment docs.

@ghiggi
Copy link

ghiggi commented Feb 25, 2022

Hi @jhamman. @djhoese told me to get in contact/involved with you to share some of my previous work related to a pytorch DataLoader designed specifically for the dataloading of spatio-temporal data stored in Zarr.

Our implementation is targeted to the development of autoregressive forecasting models, where we differentiate between 3 sources of data:

  1. dynamic data: which varies over time, are the target prediction, and can be reinjected as input into the next model iterations
  2. boundary condition data: which are injected into models at each time step and are known a priori (i.e. top-of-the-atmosphere solar radiation as a function of time) or can be computed as a function of previous predictions
  3. static data: which are invariant to time

Our implementation does not currently sample spatial patches although we plan to work on that in the coming months, and we plan to formalize everything in a xforecasting library.

The proof-of-concept is available here

We also plan to design a sort of xscaler library to preprocess nD-tensor à la scikitlearn.scaler fashion ;)

Cheers

@maxrjones maxrjones changed the title Add pytorch dataloader Add PyTorch dataloader Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants