Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepending batch dimensions to match Keras interfaces #39

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

cmdupuis3
Copy link

See issues #36 and #38

@jhamman @rabernat I'm not sure if this solves the problem in all cases, can you have a look?

@rabernat
Copy link
Contributor

Thanks a lot for getting this PR started Chris!

I'm not sure if this solves the problem in all cases, can you have a look?

It's not possible for me to know that just by looking at the code. That's why we have tests!

As you can see your changes have caused existing tests to fail. That's because the generated batch dataset is being compared with an "expected" dataset.

https://github.com/pangeo-data/xbatcher/blob/d98ad21f65affa306dfd09d4b1d00e14f599fe65/xbatcher/tests/test_generators.py#L23-L32

Since the expected dataset does not have the extra dimension, the test fails. This raises the following question: should this feature be optional? If so, we need to create a new option for it. Just changing the behavior of the code in this way is a "breaking change" to our API, and could cause problems for other xbatcher users. Tests help guard against this.

Changing the API is not out of the question--this is a very new and experimental package. But it would need to be discussed with the other developers and motivated by a clear argument.

If we do add an option for this feature (e.g. squeeze_batch_dimension=True as a default, but optionally False to enable the behavior in this PR), that needs to be covered by tests as well.

You can run the test suite yourself locally as

pytest -v

(You'll need pytest installed.)

@cmdupuis3
Copy link
Author

Sounds good, I'll try adding the squeeze option and see what happens. I'm guessing xbatcher is supporting non-Keras use cases as well?

@cmdupuis3
Copy link
Author

Still need to add unit test(s)

@codecov
Copy link

codecov bot commented Nov 19, 2021

Codecov Report

Merging #39 (f569190) into main (decee57) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main       #39   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            2         2           
  Lines           77        82    +5     
  Branches        18        20    +2     
=========================================
+ Hits            77        82    +5     
Impacted Files Coverage Δ
xbatcher/generators.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update decee57...f569190. Read the comment docs.

@cmdupuis3
Copy link
Author

@jhamman @rabernat I think this is ready to be merged

@cmdupuis3 cmdupuis3 marked this pull request as draft November 19, 2021 21:18
@cmdupuis3
Copy link
Author

Doing some cleanup atm

@cmdupuis3 cmdupuis3 marked this pull request as ready for review November 30, 2021 19:54
@cmdupuis3
Copy link
Author

@jhamman @rabernat Sorry for the confusion, I think this is in a mergeable state now.

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxrjones, since you're a bit more familiar with Keras, could you take a quick look at this PR in relation to #35 and #36 when you have time? This PR actually doesn't conflict too much with recent changes, though the tests might need some refactoring in line with #124.

That said, I'm thinking about how useful the squeeze_batch_dim is compared to just calling xr.DataArray.squeeze or xr.DataArray.expand_dims manually. I suppose that adding this extra squeeze_batch_dim option increases usability, but it can also be a source of extra confusion, especially since the required shape of an input tensor is very subjective to what model architecture people are using.

@@ -105,6 +117,7 @@ def __init__(
batch_dims={},
concat_input_dims=False,
preload_batch=True,
squeeze_batch_dim=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to set the default to True? I.e. squeeze the batch dimension if all dims are included in the input dims. This might break existing behaviour, so wondering if the default should be False instead for backward compatibility.

Copy link
Author

@cmdupuis3 cmdupuis3 Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would definitely break existing behavior. I'm not really sure which would be more intuitive, I think that's kind of subjective, but personally I would agree with you.

The problem is that the existing behavior already squeezes the array, so I think it would be kind of a pain to have to use xr.DataArray.expand_dims, because you don't really know which dimension is being squeezed (because it's not there anymore). You'd probably end up digging through the code and iterating a few times, like I did in this scenario. Also, the fact that this behavior results in different-dimensional array results just from changing the batch dims, I think breaks the grammar that is being established here. I think your proposal to have False as default plus xr.DataArray.squeeze is much more logical here.

Copy link
Member

@weiji14 weiji14 Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the existing behavior already squeezes the array, so I think it would be kind of a pain to have to use xr.DataArray.expand_dims,

Yeah, I also find this unintuitive, the fact that xbatcher squeezes any non input_dims into a dim called sample (edit: there's actually a related issue at #127). So another way to think of this squeeze_batch_dim feature is that xbatcher will just return the cropped/sliced/chipped array without squeezing the dims. This is more important with higher dimensional (>3D) arrays because sometimes you do want to preserve the extra dims.

@@ -90,6 +98,10 @@ class BatchGenerator:
preload_batch : bool, optional
If ``True``, each batch will be loaded into memory before reshaping /
processing, triggering any dask arrays to be computed.
squeeze_batch_dim : bool, optional
If ``False`` and all dims are input dims, each batch's dataset will have a
"batch" dimension of size 1 prepended to the array. This functionality is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions on the documentation. Maybe L86 and L87 could be edited to indicate the squeeze behavior is controllable with squeeze_batch_dim?

Also, this sentence might be a bit confusing to read. So an extra dimension of size 1 is added/prepended only when batch_dims is None or unset. For cases where len(batch_dims) >= 1), the squeezing/collapsing of dimensions is still happening though. I'm wondering what's a good way to reword this to make it clearer on what is happening and why this option exists.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem would only appear in this one corner case, as far as I could tell. So, my solution was coded to only apply when there were no batch_dims (at least that was my intention). If you're changing the default behavior though, you can probably make this simpler, and therefore the docs would be less confusing too.

@cmdupuis3
Copy link
Author

Actually, if we go with the proposal of default False and rely on xr.DataArray.squeeze, this option becomes completely unnecessary.

@weiji14
Copy link
Member

weiji14 commented Nov 30, 2022

Actually, if we go with the proposal of default False and rely on xr.DataArray.squeeze, this option becomes completely unnecessary.

Hmm, let's get a second opinion from someone who uses Keras. If they agree, we can close this Pull Request?

@maxrjones
Copy link
Member

Sorry for taking so long to respond here. My understanding of the conversation above is that the purpose of this PR can be broken down to two issues:

  1. Should there be an easier option to retain all the original dataset dimensions in the generated batches? Currently one would need to include all the original dims in the dict provided to input_dims. My opinion is that this feature would be useful. One option would be for the user to specify input_dims=False to avoid any stacking step. Since this is not the main purpose of this PR, I suggest opening a separate issue to discuss further if any of you agree that this is worthwhile.
  2. Should there always be a sample dimension (previously called the batch dimension)? This is the topic of Should there always be a sample dimension? #127. If we chose to implement the feature suggested in point 1, I think it makes sense to include a sample dimension in the case in which the user specifies that stacking should take place. In the case without a .stack step, I think it’s trivial to add a length 1 dimension and that it is not worth the API complexity to support that through the batch generator.

If either of these points would benefit from some in-depth discussion, we could put this PR on the agenda for next Monday’s Pangeo ML working group meeting.

@cmdupuis3
Copy link
Author

Hi all, just wondering what the status of this is?

I'm writing some xbatcher example notebooks and I'm running into this issue again.

@maxrjones
Copy link
Member

Hi all, just wondering what the status of this is?

I'm writing some xbatcher example notebooks and I'm running into this issue again.

Apologies again for the delay - my attention was drawn elsewhere due to illness, travel, and some other urgent tasks. If I recall correctly we proposed at our last meeting to always have a defined axis order including a sample dimension and provide the option to specify that no transposing/stacking should occur using input_dims=None. I will try to draft this alternative solution by Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants