use shfl_sync instead of __shfl_sync #1080

shintaro-iwasaki · 2022-04-27T21:20:34Z

Summary: This patch replaces CUDA-specific __shfl_sync used in D35758762 (dfb36cd) with shfl_sync, which is a wrapper that supports both NVIDIA and AMD GPUs (like D33231489 (c6df576)).

Differential Revision: D35980472

Summary: This patch replaces CUDA-specific `__shfl_sync` used in D35758762 (pytorch@dfb36cd) with `shfl_sync`, which is a wrapper that supports both NVIDIA and AMD GPUs (like D33231489 (pytorch@c6df576)). Differential Revision: D35980472 fbshipit-source-id: ae76d3c6303ddcfd345fdbb16cc9c69a5860a1f2

facebook-github-bot · 2022-04-27T21:20:55Z

This pull request was exported from Phabricator. Differential Revision: D35980472

* Make WeightDecayMode consistent (pytorch#1063) Summary: Pull Request resolved: pytorch#1063 Currently in FE we define `L2=1` and `DECOUPLE=2` but in FBGEMM we use `L2=0` and `DECOUPLE=1` (https://fburl.com/code/65u4a608). While function-wise it is OK since the interface is converted, it may introduce unnecessary confusion on the numbering. Here we make them consistent acrossing FE/BE by using `L2=1` and `DECOUPLE=2` for both. Differential Revision: D35763365 fbshipit-source-id: c61041f38844b02fdecac0fb1182a3184711d3bd * Add default values for func args in FBGEMM codegen (pytorch#1066) Summary: Pull Request resolved: pytorch#1066 We enforce mandate default values for float/int function args (usually hyper-parameters for optimizers) when generating FBGEMM code using codegen. This makes backward compatibility easier as we can add more parameters without breaking compatibility. Note: developers need to be cautious when adding new args with default values. The behavior should remain the same with default values. If no default values are provided for float/int parameters, they'll be set to 0.0/0 by default. Reviewed By: jianyuh Differential Revision: D35795294 fbshipit-source-id: 2632e1452c164d2ae7f999e9b17033ea77fe3864 * Enabling cuda (#25) * Alinging with upstream with merge_pooled_embeddings_test.py and enabling cuda. * Disabling use_cpu in split_table_batched_embeddings_test since it's still unstable. Co-authored-by: root <root@ixt-rack-61.local.lan> * enable merge_pooled_embeddings in oss (pytorch#1064) Summary: Pull Request resolved: pytorch#1064 In inference OSS we need to build fbgemm from source and we need the `merge_pooled_embeddings` operator. This is not available in fbgemm oss because of this: https://www.internalfb.com/diff/D30037992 (pytorch@41ab9713cb1c083414bd9759ebb95d47609101b7)?dst_version_fbid=1066324687448445&transaction_fbid=198310519085547, a dependency on nvml.h. However, generally nvml.h is present on systems and can be located at: `${CUDA_TOOLKIT_ROOT_DIR}/lib64/stubs/libnvidia-ml.so`, as detailed here: https://tianyuliukingcrimson.wordpress.com/2018/07/23/findnvml-cmake-done-correctly-how-to-have-cmake-find-nvidia-management-library-nvml-on-windows-and-linux/. **However**, sometimes systems don't have it preinstalled with cuda for whatever reason, in which case you can get it by installing cudatoolkit-dev: `conda install -c conda-forge cudatoolkit-dev` (as i had to for my system) This changes the path that `libnvidia-ml.so` exists on, so we can give the option for people to specify where this library lives: `nvml_lib_path` post: https://fb.workplace.com/groups/2126278550786248/posts/5357069087707162 Reviewed By: jspark1105 Differential Revision: D35785768 fbshipit-source-id: a2cb10fb54d5d97cbb6ecadfbbcb0c37bce7043b * Add GLIBCXX_USE_CXX11_ABI compile option (pytorch#1073) Summary: Pull Request resolved: pytorch#1073 Reviewed By: s4ayub Differential Revision: D35682606 fbshipit-source-id: 58c78ec52a9b5caebbded97f836e658c59fb0d51 * Add even division checker for offsets in boundary checker (pytorch#1071) Summary: Pull Request resolved: pytorch#1071 As title. This might be helpful to detect and check the issues for s268163 Enforce the following check: 1. the size of offsets need to be exactly B * T + 1 2. the last element of offsets should be equal to indices.numel() 3. the max pooling size should be less than or equal to the indice weight size. Reviewed By: zhenqin95 Differential Revision: D35768276 fbshipit-source-id: d942dfc7b01bfdbcf5b3d3fb76a50f1abe2da325 * Make variable type consistent in CPU code (pytorch#1076) Summary: Pull Request resolved: pytorch#1076 Variable types got mixed up in code versions for CPU code. Here we clean it up and make variable types consistent. Reviewed By: shintaro-iwasaki Differential Revision: D35817968 fbshipit-source-id: 4de43cbac3388896d1ae81c2eafd0d154dda6fca * Follow up on throw errors directly on host code for CUDA bounds check op (pytorch#1075) Summary: Pull Request resolved: pytorch#1075 Follow up for D35768276 (pytorch@7be1fcb): throw errors directly on host code. Reviewed By: yinghai Differential Revision: D35905891 fbshipit-source-id: f97047ff9cb27f7f169dc0223fa0295cc14a8fe8 * Add dtype <-> SparseType conversion util function (pytorch#1057) Summary: Pull Request resolved: pytorch#1057 As title Reviewed By: geyyer Differential Revision: D35532366 fbshipit-source-id: 73891dd0eadcb0c79d6d0a06d7e0da911bd2519a * Implement kernel for counter based weight decay and learning rate adjustment in rowwise_adagrad (pytorch#1068) Summary: Pull Request resolved: pytorch#1068 Implemented the kernel for counter based weight decay and learning rate adjustment in rowwise_adagrad Reviewed By: csmiler Differential Revision: D35758762 fbshipit-source-id: 1953ca950c8ebd3f45c0e5c343a5c2214393b487 * add bf16 support in jagged tensor ops (pytorch#1079) Summary: Pull Request resolved: pytorch#1079 To support bf16 training Reviewed By: ajtulloch Differential Revision: D35955466 fbshipit-source-id: 0f740f29074576c026005362c78f872fec80bbcc * allow FP16-type grad_t (pytorch#1072) Summary: Pull Request resolved: pytorch#1072 This Diff partially revives D31432199 (pytorch@127f813), but only enables `grad_t = FP16` (no `BF16` support) to reduce the adverse side effect (e.g., the increase of binary size and compilation time). Specifically, D31432199 (pytorch@127f813) provides FP32, FP16, and BF16 for `grad_t`. This Diff removes BF16 options for `grad_t` (so only FP32 and FP16 for `grad_t`). Reviewed By: jianyuh Differential Revision: D35120293 fbshipit-source-id: b9a1d35f901b26277a220360a2a68583c65c8554 * use shfl_sync instead of __shfl_sync (pytorch#1080) Summary: Pull Request resolved: pytorch#1080 This patch replaces CUDA-specific `__shfl_sync` used in D35758762 (pytorch@dfb36cd) with `shfl_sync`, which is a wrapper that supports both NVIDIA and AMD GPUs (like D33231489 (pytorch@c6df576)). Reviewed By: dvksabin Differential Revision: D35980472 fbshipit-source-id: f77c9e9dce31d55e80a201f80f98e44bbe8dce9e * allow specify output_dtype for split no_bag embedding forward (pytorch#1067) Summary: "split_embedding_nobag_forward" did not accept "output_dtype" parameters when "{% if not dense and not nobag %}". So when user created "SplitTableBatchedEmbeddingBagsCodegen" with the "output_dtype" to some type needed, it is not passed into split_embedding_nobag_forward, so the real output data type is not aligned with output_dtype user specified. And also there is no warning or error happens as well. This PR added the "output_dtype" support for "split_embedding_nobag_forward". Pull Request resolved: pytorch#1067 Reviewed By: brad-mengchi, shintaro-iwasaki Differential Revision: D35866293 Pulled By: jianyuh fbshipit-source-id: 4cf95c649dcd25408668644788f3817561d35c20 * Fix the OSS nightly build; Release FBGEMM v0.1.0 for TorchRec OSS release (pytorch#1088) Summary: Pull Request resolved: pytorch#1088 The OSS nightly build for CPU and GPU is broken, due to the package name configuration conflicts between pyproject.toml and setup.py. This Diff removes pyproject.toml and only keep setup.py as the ground truth. Reviewed By: geyyer, brad-mengchi Differential Revision: D36040950 fbshipit-source-id: 2ca5a6f1da6cc4e8e1fecdf98c6ef6921cbce4ae * Fix the OSS CUDA GPG key CI test failure (pytorch#1089) Summary: Pull Request resolved: pytorch#1089 Check https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772 This Diff fixes the OSS test failure in https://github.com/pytorch/FBGEMM/runs/6242695168?check_suite_focus=true Reviewed By: brad-mengchi Differential Revision: D36048995 fbshipit-source-id: 13fd7fc24c41f4042392849b22e29b8659b782b8 * Add permute_pooled_embedding_ops_split for cpu_only and gpu (pytorch#1082) Summary: Pull Request resolved: pytorch#1082 Following up on the post https://fb.workplace.com/groups/2126278550786248/permalink/5353232054757532/ Reviewed By: jianyuh Differential Revision: D35971699 fbshipit-source-id: a3c8a9d8ce453abb732bd0774cb4f95ef10240f9 * clean up output_dtype tensor allocation branches (pytorch#1086) Summary: Pull Request resolved: pytorch#1086 As title Reviewed By: brad-mengchi Differential Revision: D36018114 fbshipit-source-id: 9d8d6b5af53a4a75b917dee673629e6feeaa7ba3 * Fix build for embedding_inplace_update/embedding_inplace_update_cpu (pytorch#1081) Summary: Pull Request resolved: pytorch#1081 Reviewed By: jasonjk-park, jianyuh, houseroad Differential Revision: D35984814 fbshipit-source-id: 40a4d3dd5cfffb4240b517abb70e723abb396dff Co-authored-by: Wang Zhou <wangzhou@fb.com> Co-authored-by: root <root@ixt-rack-61.local.lan> Co-authored-by: Shabab Ayub <shababayub@fb.com> Co-authored-by: Jianyu Huang <jianyuhuang@fb.com> Co-authored-by: Sabin Devkota <devkotasabin@fb.com> Co-authored-by: Jongsoo Park <jongsoo@fb.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: pengwa@microsoft.com <pengwa@microsoft.com> Co-authored-by: Rostyslav Geyyer <grostyslav@fb.com> Co-authored-by: Mengchi Zhang <mengchi@fb.com>

facebook-github-bot added cla signed fb-exported labels Apr 27, 2022

facebook-github-bot closed this in 7fcc520 Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use shfl_sync instead of __shfl_sync #1080

use shfl_sync instead of __shfl_sync #1080

Uh oh!

shintaro-iwasaki commented Apr 27, 2022

Uh oh!

facebook-github-bot commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

use shfl_sync instead of __shfl_sync #1080

use shfl_sync instead of __shfl_sync #1080

Uh oh!

Conversation

shintaro-iwasaki commented Apr 27, 2022

Uh oh!

facebook-github-bot commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants