Skip to content

Conversation

d1jang
Copy link
Contributor

@d1jang d1jang commented Sep 23, 2021

Summary:
This change fixes a bug that Static Runtime's aten::embedding_bag out variant implementation creates aliases in its managed output tensors.

Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values.

This bug is detected in inline_cvr/remote_ro by a DCHECK, verify_no_memory_overlap (introduced by D30211705 (3fb33b3)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production.

This change fixes the root cause from _embedding_bag_cpu_impl_out by replacing alias creation with copying.

Note that this change also includes a fundamental change in Static Runtime's unit testing: testStaticRuntime exercises the given graph 3 times:

  1. profile run
  2. run using the profile to allocate managed tensors
  3. reuse the managed tensors -- newly added

Adding 3 reveals this bug with a new unittest EmbeddingBagWithManagedOutput.

Test Plan:

  • Confirmed that the crash experienced by StaticRuntime.EmbeddingBagWithManagedOutput disappears with this change (crash paste: P459807248).

  • Added StaticRuntime.EmbeddingBagWithManagedOutput to detect the same problem in the future.

Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 23, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 7b873d3 (more details on the Dr. CI page):



🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-xenial-py3.6-clang7-asan / test (default, 1, 2, linux.2xlarge) (1/3)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-01T00:19:56.7118750Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in
2021-10-01T00:19:56.6543012Z     #9 0x55dfc6d978f2 in PyEval_EvalCode /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:731
2021-10-01T00:19:56.6543814Z     #10 0x55dfc6dffcd5 in run_mod /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/pythonrun.c:1025
2021-10-01T00:19:56.6544662Z     #11 0x55dfc6e01d5d in PyRun_StringFlags /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/pythonrun.c:949
2021-10-01T00:19:56.6545605Z     #12 0x55dfc6e01dbb in PyRun_SimpleStringFlags /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/pythonrun.c:445
2021-10-01T00:19:56.6546469Z     #13 0x55dfc6e02926 in run_command /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Modules/main.c:301
2021-10-01T00:19:56.6547228Z     #14 0x55dfc6e02926 in Py_Main /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Modules/main.c:749
2021-10-01T00:19:56.6548205Z     #15 0x55dfc6d3c196 in main /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Programs/python.c:69
2021-10-01T00:19:56.7116667Z     #16 0x7f6bab16383f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
2021-10-01T00:19:56.7117486Z     #17 0x55dfc6dcc33d in _start (/opt/conda/bin/python3.6+0x1a733d)
2021-10-01T00:19:56.7117853Z 
2021-10-01T00:19:56.7118750Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in 
2021-10-01T00:19:56.7381034Z + retcode=1
2021-10-01T00:19:56.7382103Z + set -e
2021-10-01T00:19:56.7382627Z + return 1
2021-10-01T00:19:56.7384483Z + [[ linux-xenial-py3.6-clang7-asan-default == *-NO_AVX-* ]]
2021-10-01T00:19:56.7385730Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X ]]
2021-10-01T00:19:56.7387243Z + [[ linux-xenial-py3.6-clang7-asan-default == *-NO_AVX2-* ]]
2021-10-01T00:19:56.7388849Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]]
2021-10-01T00:19:56.7390358Z + [[ linux-xenial-py3.6-clang7-asan-default == *-NO_AVX512-* ]]
2021-10-01T00:19:56.7391676Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\5\1\2 ]]
2021-10-01T00:19:56.7392342Z ++ mktemp

See GitHub Actions build linux-xenial-cuda11.3-py3.6-gcc7 / build (2/3)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-09-30T23:14:57.7190965Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m
2021-09-30T23:14:57.7185300Z �[36;1mfi�[0m
2021-09-30T23:14:57.7185722Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2021-09-30T23:14:57.7186397Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2021-09-30T23:14:57.7187032Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2021-09-30T23:14:57.7187716Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2021-09-30T23:14:57.7188268Z �[36;1m  exit 1�[0m
2021-09-30T23:14:57.7188572Z �[36;1mfi�[0m
2021-09-30T23:14:57.7189092Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2021-09-30T23:14:57.7189728Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2021-09-30T23:14:57.7190319Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2021-09-30T23:14:57.7190965Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2021-09-30T23:14:57.7191650Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2021-09-30T23:14:57.7192079Z �[36;1m  exit 1�[0m
2021-09-30T23:14:57.7192331Z �[36;1mfi�[0m
2021-09-30T23:14:57.7192679Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2021-09-30T23:14:57.7202827Z shell: /usr/bin/bash -e {0}
2021-09-30T23:14:57.7203135Z env:
2021-09-30T23:14:57.7203634Z   BUILD_ENVIRONMENT: linux-xenial-cuda11.3-py3.6-gcc7
2021-09-30T23:14:57.7204776Z   DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
2021-09-30T23:14:57.7205908Z   SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
2021-09-30T23:14:57.7206760Z   XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / build (3/3)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-09-30T23:13:12.7456049Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m
2021-09-30T23:13:12.7450441Z �[36;1mfi�[0m
2021-09-30T23:13:12.7450883Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2021-09-30T23:13:12.7451569Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2021-09-30T23:13:12.7452227Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2021-09-30T23:13:12.7452911Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2021-09-30T23:13:12.7453470Z �[36;1m  exit 1�[0m
2021-09-30T23:13:12.7453727Z �[36;1mfi�[0m
2021-09-30T23:13:12.7454167Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2021-09-30T23:13:12.7454825Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2021-09-30T23:13:12.7455400Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2021-09-30T23:13:12.7456049Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2021-09-30T23:13:12.7456765Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2021-09-30T23:13:12.7457186Z �[36;1m  exit 1�[0m
2021-09-30T23:13:12.7457454Z �[36;1mfi�[0m
2021-09-30T23:13:12.7457799Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2021-09-30T23:13:12.7468144Z shell: /usr/bin/bash -e {0}
2021-09-30T23:13:12.7468445Z env:
2021-09-30T23:13:12.7468896Z   BUILD_ENVIRONMENT: linux-xenial-py3.6-gcc5.4
2021-09-30T23:13:12.7469828Z   DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4
2021-09-30T23:13:12.7470833Z   SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
2021-09-30T23:13:12.7471724Z   XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue cla signed labels Sep 23, 2021
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@d1jang d1jang force-pushed the export-D31104345 branch 2 times, most recently from 5c5baa2 to 107df03 Compare September 30, 2021 06:04
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

…g_bag (pytorch#65516)

Summary:
Pull Request resolved: pytorch#65516

This change fixes a bug that Static Runtime's `aten::embedding_bag` out variant implementation creates aliases in its managed output tensors.

Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values.

This bug is detected in inline_cvr/remote_ro by a DCHECK, `verify_no_memory_overlap` (introduced by D30211705 (pytorch@3fb33b3)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production.

This change fixes the root cause from `_embedding_bag_cpu_impl_out`  by replacing alias creation with copying.

Note that this change also includes a fundamental change in Static Runtime's unit testing: `testStaticRuntime` exercises the given graph 3 times:
 1. profile run
 2. run using the profile to allocate managed tensors
 3. reuse the managed tensors -- newly added

Adding 3 reveals this bug with a new unittest `EmbeddingBagWithManagedOutput`.

Test Plan:
- Confirmed that the crash experienced by `StaticRuntime.EmbeddingBagWithManagedOutput` disappears with this change (crash paste: P459807248).

- Added `StaticRuntime.EmbeddingBagWithManagedOutput` to detect the same problem in the future.

Reviewed By: hlu1

Differential Revision: D31104345

fbshipit-source-id: 89eee43111d4979b07a0d9bdd5049664a7d6774b
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31104345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants