[ROCm] bfloat16 enablement #1560

liligwu · 2023-01-25T22:36:59Z

Enable bfloat16 on ROCm

netlify · 2023-01-25T22:37:16Z

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

Name	Link
🔨 Latest commit	`2cde382`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/63e1837a882c1a0008455cf0

amathews-amd · 2023-01-31T21:22:35Z

cc : @jianyuh @shintaro-iwasaki

q10 · 2023-02-01T20:06:14Z

LGTM; importing to Phabricator

facebook-github-bot · 2023-02-01T20:07:31Z

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh

facebook-github-bot · 2023-02-06T20:59:42Z

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

q10 · 2023-02-06T22:05:32Z

@liligwu Sorry to bug you one more time. Could I trouble you to do 2 things to keep our build system happe:

Rebase this PR branch to top of latest main
Apply approximately the following diff to keep the internal linter happy:

@@ -1398,12 +1398,13 @@
 };

 #ifdef __HIP_PLATFORM_HCC__
-using __nv_bfloat16=hip_bfloat16;
+using __nv_bfloat16 = hip_bfloat16;

 typedef struct __align__(4) {
   uint16_t x;
   uint16_t y;
-} __nv_bfloat162_raw;
+}
+__nv_bfloat162_raw;

 struct __align__(4) __nv_bfloat162 {
   __nv_bfloat16 x;
@@ -1479,12 +1480,12 @@
 #ifdef __HIP_PLATFORM_HCC__
 // the descriptions of __float2bfloat16 and __float2bfloat16_rn are identical
 // https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__MISC.html#group__CUDA__MATH____BFLOAT16__MISC
-static __host__ __device__ __nv_bfloat16 __float2bfloat16(float f){
+static __host__ __device__ __nv_bfloat16 __float2bfloat16(float f) {
   __nv_bfloat16 output;
   return output.round_to_bfloat16(f);
 }

-static __host__ __device__ __nv_bfloat16 __float2bfloat16_rn(float f){
+static __host__ __device__ __nv_bfloat16 __float2bfloat16_rn(float f) {
   __nv_bfloat16 output;
   return output.round_to_bfloat16(f);
 }

After this, we should be good to merge. Thanks for your patience!

liligwu · 2023-02-06T22:45:42Z

@q10 , it's done.

liligwu · 2023-02-07T18:53:37Z

@q10 , @shintaro-iwasaki . Could you please close this PR if everything looks good? Thank you.

facebook-github-bot · 2023-02-07T19:55:24Z

@q10 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

q10 · 2023-02-07T19:58:16Z

@q10 , @shintaro-iwasaki . Could you please close this PR if everything looks good? Thank you.

Yep, was making sure the GitHub builds were passing

liligwu · 2023-02-07T20:05:28Z

@q10 , @shintaro-iwasaki . Could you please close this PR if everything looks good? Thank you.

Yep, was making sure the GitHub builds were passing

Regarding the ROCm CI failure

2023-02-07T19:53:19.5087315Z test_gpu (main.TableBatchedEmbeddingsTest) ... Memory access fault by GPU node-2 (Agent handle: 0x55e30ef37ad0) on address 0x7f38a5a82000. Reason: Page not present or supervisor privilege.
2023-02-07T19:53:19.7516263Z /workspace/fbgemm-private-jenkins/.jenkins/rocm/build_and_test.sh: line 49: 19939 Aborted (core dumped) python batched_unary_embeddings_test.py --verbose
2023-02-07T19:53:20.3188419Z ##[error]Process completed with exit code 134.

I opened an issue #1559

I don't know if it was observed in CI, but it failed in a CUDA container I tested.

q10 · 2023-02-07T21:53:33Z

@q10 , @shintaro-iwasaki . Could you please close this PR if everything looks good? Thank you.

Yep, was making sure the GitHub builds were passing

Regarding the ROCm CI failure

2023-02-07T19:53:19.5087315Z test_gpu (main.TableBatchedEmbeddingsTest) ... Memory access fault by GPU node-2 (Agent handle: 0x55e30ef37ad0) on address 0x7f38a5a82000. Reason: Page not present or supervisor privilege.
2023-02-07T19:53:19.7516263Z /workspace/fbgemm-private-jenkins/.jenkins/rocm/build_and_test.sh: line 49: 19939 Aborted (core dumped) python batched_unary_embeddings_test.py --verbose
2023-02-07T19:53:20.3188419Z ##[error]Process completed with exit code 134.

I opened an issue #1559

I don't know if it was observed in CI, but it failed in a CUDA container I tested.

Thanks for filing the issue. The issue is observed in CI, but it is very unreliable, as one of the earlier builds in this PR had previously passed prior to your last lint commit. I suspect it is an issue with the instance that the job was running on. We will investigate this further after this PR is merged.

facebook-github-bot · 2023-02-07T23:52:49Z

@q10 merged this pull request in abefe30.

liligwu and others added 4 commits January 25, 2023 19:11

FP32 weights passed, FP16 debug

2a5209c

correct the data type for __BFLOAT162_TO_UI and __BFLOAT162_TO_CUI

12cc867

fix bfloat162 packing

3aa6dd6

clean the code

7bea0d9

facebook-github-bot added cla signed module: rocm labels Jan 25, 2023

liligwu mentioned this pull request Jan 25, 2023

Add BF16 output support for inference TBE #1498

Closed

q10 requested changes Feb 3, 2023

View reviewed changes

fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh Outdated Show resolved Hide resolved

q10 reviewed Feb 3, 2023

View reviewed changes

fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh Show resolved Hide resolved

address Lint indentation issues

bc6f6b2

q10 approved these changes Feb 6, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into rocm-bfloat16-enable

b2679a2

address lint issues

2cde382

facebook-github-bot closed this in abefe30 Feb 7, 2023

facebook-github-bot added the Merged label Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] bfloat16 enablement #1560

[ROCm] bfloat16 enablement #1560

liligwu commented Jan 25, 2023

netlify bot commented Jan 25, 2023 •

edited

Loading

amathews-amd commented Jan 31, 2023

q10 commented Feb 1, 2023

facebook-github-bot commented Feb 1, 2023

facebook-github-bot commented Feb 6, 2023

q10 commented Feb 6, 2023

liligwu commented Feb 6, 2023

liligwu commented Feb 7, 2023

facebook-github-bot commented Feb 7, 2023

q10 commented Feb 7, 2023

liligwu commented Feb 7, 2023

q10 commented Feb 7, 2023

facebook-github-bot commented Feb 7, 2023

[ROCm] bfloat16 enablement #1560

[ROCm] bfloat16 enablement #1560

Conversation

liligwu commented Jan 25, 2023

netlify bot commented Jan 25, 2023 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

amathews-amd commented Jan 31, 2023

q10 commented Feb 1, 2023

facebook-github-bot commented Feb 1, 2023

facebook-github-bot commented Feb 6, 2023

q10 commented Feb 6, 2023

liligwu commented Feb 6, 2023

liligwu commented Feb 7, 2023

facebook-github-bot commented Feb 7, 2023

q10 commented Feb 7, 2023

liligwu commented Feb 7, 2023

q10 commented Feb 7, 2023

facebook-github-bot commented Feb 7, 2023

netlify bot commented Jan 25, 2023 •

edited

Loading