Skip to content

Conversation

@wconstab
Copy link
Contributor

Summary:
Hash(c10::Scalar) made a bad assumption that it was valid to just hash over all the bytes of data of the c10::Scalar struct.

Becuase c10::Scalar stores a union of different (float/int/complex) types with different sizes, not all bytes are valid in all cases. Hash() should only read the bytes corresponding to the currently active type.

Test Plan: Added new unit tests. Verified HashTest.Scalar failed with the original Hash() impl and then fixed.

Differential Revision: D32367564

@pytorch-probot
Copy link

pytorch-probot bot commented Nov 11, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/wconstab/pytorch/blob/c77d02a0087cea29ed266973852de8d4225a376c/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
docker-builds ciflow/all 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis ciflow/all, ciflow/linux, ciflow/mobile 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos 🚫 skipped
macos-10-15-py3-x86-64 ciflow/all, ciflow/macos 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Nov 11, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit c77d02a (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32367564

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do static_cast<uint8_t*> here? Can you explain the memory layout here a bit such that it's easier for me to understand why writing to "offset 0" makes sense? Is there any way we can tell the differences from a to b such that we can also verify that this operation succeeds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't any way to verify it without violating the API of the Scalar. I didn't know the memory layout either, except by experimenting. I found it's sizeof() was 32 bytes, which surprised me. I don't know exactly what memory i'm stepping on here, but, for the purposes of this experiment it's not important as long as it isn't memory used by the Long member field, which it isn't.

I did verify it though, by observing that the below EXPECTs all failed when I clobber this memory without fixing the hash function.

I don't see the point of changing the cast. I don't benefit from any compiler safety checks here, in fact, i'm doing something 'unsafe' on purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for you, I decided to try this :)
But it doesn't appear to work:
Static_cast from 'c10::Scalar *' to 'uint8_t *' (aka 'unsigned char *') is not allowed

Anyway, regular cast is fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, my bad. It should be reinterpret_cast for your use case:
*(reinterpret_cast<uint8_t*>(&b)) = 1;.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a little research. It seems non-trivial to find any resources to answer me the memory layout question. So, it's fine to keep the comment intact.

Summary:
Pull Request resolved: pytorch#68201

Hash(c10::Scalar) made a bad assumption that it was valid to just hash over all the bytes of data of the c10::Scalar struct.

Becuase c10::Scalar stores a union of different (float/int/complex) types with different sizes, not all bytes are valid in all cases.  Hash() should only read the bytes corresponding to the currently active type.

Test Plan: Added new unit tests.  Verified HashTest.Scalar failed with the original Hash() impl and then fixed.

Differential Revision: D32367564

fbshipit-source-id: be1dbc4932890ebb70898184483b6776b068c6b0
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32367564

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32367564

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in dc24503.

desertfire pushed a commit that referenced this pull request Nov 15, 2021
Summary:
Pull Request resolved: #68201

Hash(c10::Scalar) made a bad assumption that it was valid to just hash over all the bytes of data of the c10::Scalar struct.

Becuase c10::Scalar stores a union of different (float/int/complex) types with different sizes, not all bytes are valid in all cases.  Hash() should only read the bytes corresponding to the currently active type.

Test Plan: Added new unit tests.  Verified HashTest.Scalar failed with the original Hash() impl and then fixed.

Reviewed By: alanwaketan

Differential Revision: D32367564

fbshipit-source-id: ac30dd4f6dd0513954986d3d23c0c11ba802c37b
desertfire pushed a commit that referenced this pull request Nov 15, 2021
Summary:
Pull Request resolved: #68201

Hash(c10::Scalar) made a bad assumption that it was valid to just hash over all the bytes of data of the c10::Scalar struct.

Becuase c10::Scalar stores a union of different (float/int/complex) types with different sizes, not all bytes are valid in all cases.  Hash() should only read the bytes corresponding to the currently active type.

Test Plan: Added new unit tests.  Verified HashTest.Scalar failed with the original Hash() impl and then fixed.

Reviewed By: alanwaketan

Differential Revision: D32367564

fbshipit-source-id: ac30dd4f6dd0513954986d3d23c0c11ba802c37b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants