Skip to content

Conversation

@annxingyuan
Copy link
Contributor

@annxingyuan annxingyuan commented Sep 2, 2020

This fixes #3869.

The leak happens because we call reduce recursively without cleaning up intermediate outputs. This PR changes max to call reduce iteratively and cleans up intermediate outputs along the way.

The tests I added actually fail in master because of our check in engine for leaked dataIds. We didn't happen to have a max test with a large enough dimension to trigger multiple reductions.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

@annxingyuan annxingyuan self-assigned this Sep 2, 2020
@tafsiri
Copy link
Contributor

tafsiri commented Sep 3, 2020

@annxingyuan before reviewing this in detail would you be able to outline in the PR description where the mem leak is in the existing code and any thoughts on why the mem leak checker doesn't catch this?

@annxingyuan
Copy link
Contributor Author

@tafsiri yes - I added some notes in the description - let me know if anything is unclear!

Copy link
Contributor

@tafsiri tafsiri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great description and fix! Left a few suggested changes. But LGTM! Thanks for digging into this.

Reviewed 5 of 6 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @annxingyuan, @lina128, and @tafsiri)


tfjs-backend-webgl/src/kernel_utils/reduce.ts, line 25 at r2 (raw file):

type ReduceTypes = 'all'|'any'|'max'|'min'|'sum'|'prod';

function getReductionSizes(inShape: number[]):

Would you be able to add a comment for this function? It would add some flavour to the notion of multiple reductions being needed to get the overall reduction done effectively.


tfjs-backend-webgl/src/kernels/Max_test.ts, line 23 at r2 (raw file):

describeWithFlags('Max', ALL_ENVS, () => {
  it('does not have memory leak.', async () => {

Could you name this something like "does not have memory leak when calling reduce multiple times" to help future us remember why this test is here in addition to the general test in core.


tfjs-core/src/ops/max_test.ts, line 36 at r2 (raw file):

  it('with a large dimension', async () => {
    const aData = new Float32Array(100);

would making this a bit bigger (say 100x100) (e.g. tf.ones([100, 100])) be better for catching problems in implementations that may have larger effective window sizes

Copy link
Contributor Author

@annxingyuan annxingyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128 and @tafsiri)


tfjs-backend-webgl/src/kernel_utils/reduce.ts, line 25 at r2 (raw file):

Previously, tafsiri (Yannick Assogba) wrote…

Would you be able to add a comment for this function? It would add some flavour to the notion of multiple reductions being needed to get the overall reduction done effectively.

Done

I renamed the function - hopefully now it's more clear what the purpose is.


tfjs-backend-webgl/src/kernels/Max_test.ts, line 23 at r2 (raw file):

Previously, tafsiri (Yannick Assogba) wrote…

Could you name this something like "does not have memory leak when calling reduce multiple times" to help future us remember why this test is here in addition to the general test in core.

Done


tfjs-core/src/ops/max_test.ts, line 36 at r2 (raw file):

Previously, tafsiri (Yannick Assogba) wrote…

would making this a bit bigger (say 100x100) (e.g. tf.ones([100, 100])) be better for catching problems in implementations that may have larger effective window sizes

Done

Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @annxingyuan, @lina128, and @tafsiri)


tfjs-backend-webgl/src/kernel_utils/reduce.ts, line 38 at r3 (raw file):

  }

  return reduce(output, dtype, reductionType, backend);

Hi Ann, I wonder instead of changing to iterative approach, can we do this in recursive approach, something like:

const result = reduce(output, dtype, reductionType, backend);
backend.disposeData(output.dataId);
return result;

Copy link
Contributor Author

@annxingyuan annxingyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128 and @tafsiri)


tfjs-backend-webgl/src/kernel_utils/reduce.ts, line 38 at r3 (raw file):

Previously, lina128 (Na Li) wrote…

Hi Ann, I wonder instead of changing to iterative approach, can we do this in recursive approach, something like:

const result = reduce(output, dtype, reductionType, backend);
backend.disposeData(output.dataId);
return result;

Hi Na, I think that could result in disposing the input tensor? Also the iterative approach aligns with other kernels such as cumsum.

Copy link
Contributor Author

@annxingyuan annxingyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128 and @tafsiri)


tfjs-backend-webgl/src/kernel_utils/reduce.ts, line 38 at r3 (raw file):

Previously, annxingyuan (Ann Yuan) wrote…

Hi Na, I think that could result in disposing the input tensor? Also the iterative approach aligns with other kernels such as cumsum.

Discussed offline - this approach does not dispose the input tensor, but the iterative approach aligns with other cumulation kernels in WebGL.

@annxingyuan annxingyuan merged commit 6d39fab into master Sep 3, 2020
@annxingyuan annxingyuan deleted the memleak branch September 3, 2020 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak issues in tensor.max()

4 participants