Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] bfloat16 enablement #1560

Closed
wants to merge 7 commits into from
Closed

Conversation

liligwu
Copy link
Contributor

@liligwu liligwu commented Jan 25, 2023

Enable bfloat16 on ROCm

@netlify
Copy link

netlify bot commented Jan 25, 2023

Deploy Preview for pytorch-fbgemm-docs canceled.

Name Link
🔨 Latest commit 2cde382
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/63e1837a882c1a0008455cf0

@amathews-amd
Copy link

cc : @jianyuh @shintaro-iwasaki

@q10
Copy link
Contributor

q10 commented Feb 1, 2023

LGTM; importing to Phabricator

@facebook-github-bot
Copy link
Contributor

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@q10
Copy link
Contributor

q10 commented Feb 6, 2023

@liligwu Sorry to bug you one more time. Could I trouble you to do 2 things to keep our build system happe:

  1. Rebase this PR branch to top of latest main
  2. Apply approximately the following diff to keep the internal linter happy:
@@ -1398,12 +1398,13 @@
 };

 #ifdef __HIP_PLATFORM_HCC__
-using __nv_bfloat16=hip_bfloat16;
+using __nv_bfloat16 = hip_bfloat16;

 typedef struct __align__(4) {
   uint16_t x;
   uint16_t y;
-} __nv_bfloat162_raw;
+}
+__nv_bfloat162_raw;

 struct __align__(4) __nv_bfloat162 {
   __nv_bfloat16 x;
@@ -1479,12 +1480,12 @@
 #ifdef __HIP_PLATFORM_HCC__
 // the descriptions of __float2bfloat16 and __float2bfloat16_rn are identical
 // https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__MISC.html#group__CUDA__MATH____BFLOAT16__MISC
-static __host__ __device__ __nv_bfloat16 __float2bfloat16(float f){
+static __host__ __device__ __nv_bfloat16 __float2bfloat16(float f) {
   __nv_bfloat16 output;
   return output.round_to_bfloat16(f);
 }

-static __host__ __device__ __nv_bfloat16 __float2bfloat16_rn(float f){
+static __host__ __device__ __nv_bfloat16 __float2bfloat16_rn(float f) {
   __nv_bfloat16 output;
   return output.round_to_bfloat16(f);
 }

After this, we should be good to merge. Thanks for your patience!

@liligwu
Copy link
Contributor Author

liligwu commented Feb 6, 2023

@q10 , it's done.

@liligwu
Copy link
Contributor Author

liligwu commented Feb 7, 2023

@q10 , @shintaro-iwasaki . Could you please close this PR if everything looks good? Thank you.

@facebook-github-bot
Copy link
Contributor

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@q10
Copy link
Contributor

q10 commented Feb 7, 2023

@q10 , @shintaro-iwasaki . Could you please close this PR if everything looks good? Thank you.

Yep, was making sure the GitHub builds were passing

@liligwu
Copy link
Contributor Author

liligwu commented Feb 7, 2023

@q10 , @shintaro-iwasaki . Could you please close this PR if everything looks good? Thank you.

Yep, was making sure the GitHub builds were passing

Regarding the ROCm CI failure

2023-02-07T19:53:19.5087315Z test_gpu (main.TableBatchedEmbeddingsTest) ... Memory access fault by GPU node-2 (Agent handle: 0x55e30ef37ad0) on address 0x7f38a5a82000. Reason: Page not present or supervisor privilege.
2023-02-07T19:53:19.7516263Z /workspace/fbgemm-private-jenkins/.jenkins/rocm/build_and_test.sh: line 49: 19939 Aborted (core dumped) python batched_unary_embeddings_test.py --verbose
2023-02-07T19:53:20.3188419Z ##[error]Process completed with exit code 134.

I opened an issue #1559

I don't know if it was observed in CI, but it failed in a CUDA container I tested.

@q10
Copy link
Contributor

q10 commented Feb 7, 2023

@q10 , @shintaro-iwasaki . Could you please close this PR if everything looks good? Thank you.

Yep, was making sure the GitHub builds were passing

Regarding the ROCm CI failure

2023-02-07T19:53:19.5087315Z test_gpu (main.TableBatchedEmbeddingsTest) ... Memory access fault by GPU node-2 (Agent handle: 0x55e30ef37ad0) on address 0x7f38a5a82000. Reason: Page not present or supervisor privilege.
2023-02-07T19:53:19.7516263Z /workspace/fbgemm-private-jenkins/.jenkins/rocm/build_and_test.sh: line 49: 19939 Aborted (core dumped) python batched_unary_embeddings_test.py --verbose
2023-02-07T19:53:20.3188419Z ##[error]Process completed with exit code 134.

I opened an issue #1559

I don't know if it was observed in CI, but it failed in a CUDA container I tested.

Thanks for filing the issue. The issue is observed in CI, but it is very unreliable, as one of the earlier builds in this PR had previously passed prior to your last lint commit. I suspect it is an issue with the instance that the job was running on. We will investigate this further after this PR is merged.

@facebook-github-bot
Copy link
Contributor

@q10 merged this pull request in abefe30.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants