Skip to content

Conversation

shintaro-iwasaki
Copy link
Contributor

Summary: This patch replaces CUDA-specific __shfl_sync used in D35758762 (dfb36cd) with shfl_sync, which is a wrapper that supports both NVIDIA and AMD GPUs (like D33231489 (c6df576)).

Differential Revision: D35980472

Summary: This patch replaces CUDA-specific `__shfl_sync` used in D35758762 (pytorch@dfb36cd) with `shfl_sync`, which is a wrapper that supports both NVIDIA and AMD GPUs (like D33231489 (pytorch@c6df576)).

Differential Revision: D35980472

fbshipit-source-id: ae76d3c6303ddcfd345fdbb16cc9c69a5860a1f2
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35980472

liligwu added a commit to ROCm/FBGEMM that referenced this pull request May 2, 2022
* Make WeightDecayMode consistent (pytorch#1063)

Summary:
Pull Request resolved: pytorch#1063

Currently in FE we define `L2=1` and `DECOUPLE=2` but in FBGEMM we use `L2=0` and `DECOUPLE=1` (https://fburl.com/code/65u4a608). While function-wise it is OK since the interface is converted, it may introduce unnecessary confusion on the numbering. Here we make them consistent acrossing FE/BE by using `L2=1` and `DECOUPLE=2` for both.

Differential Revision: D35763365

fbshipit-source-id: c61041f38844b02fdecac0fb1182a3184711d3bd

* Add default values for func args in FBGEMM codegen (pytorch#1066)

Summary:
Pull Request resolved: pytorch#1066

We enforce mandate default values for float/int function args (usually hyper-parameters for optimizers) when generating FBGEMM code using codegen. This makes backward compatibility easier as we can add more parameters without breaking compatibility.

Note: developers need to be cautious when adding new args with default values. The behavior should remain the same with default values. If no default values are provided for float/int parameters, they'll be set to 0.0/0 by default.

Reviewed By: jianyuh

Differential Revision: D35795294

fbshipit-source-id: 2632e1452c164d2ae7f999e9b17033ea77fe3864

* Enabling cuda (#25)

* Alinging with upstream with merge_pooled_embeddings_test.py and enabling cuda.

* Disabling use_cpu in split_table_batched_embeddings_test since it's still unstable.

Co-authored-by: root <root@ixt-rack-61.local.lan>

* enable merge_pooled_embeddings in oss (pytorch#1064)

Summary:
Pull Request resolved: pytorch#1064

In inference OSS we need to build fbgemm from source and we need the `merge_pooled_embeddings` operator.

This is not available in fbgemm oss because of this: https://www.internalfb.com/diff/D30037992 (pytorch@41ab9713cb1c083414bd9759ebb95d47609101b7)?dst_version_fbid=1066324687448445&transaction_fbid=198310519085547, a dependency on nvml.h.

However, generally nvml.h is present on systems and can be located at: `${CUDA_TOOLKIT_ROOT_DIR}/lib64/stubs/libnvidia-ml.so`, as detailed here: https://tianyuliukingcrimson.wordpress.com/2018/07/23/findnvml-cmake-done-correctly-how-to-have-cmake-find-nvidia-management-library-nvml-on-windows-and-linux/.

**However**, sometimes systems don't have it preinstalled with cuda for whatever reason, in which case you can get it by installing cudatoolkit-dev:

`conda install -c conda-forge cudatoolkit-dev` (as i had to for my system)

This changes the path that `libnvidia-ml.so` exists on, so we can give the option for people to specify where this library lives: `nvml_lib_path`

post: https://fb.workplace.com/groups/2126278550786248/posts/5357069087707162

Reviewed By: jspark1105

Differential Revision: D35785768

fbshipit-source-id: a2cb10fb54d5d97cbb6ecadfbbcb0c37bce7043b

* Add GLIBCXX_USE_CXX11_ABI compile option (pytorch#1073)

Summary: Pull Request resolved: pytorch#1073

Reviewed By: s4ayub

Differential Revision: D35682606

fbshipit-source-id: 58c78ec52a9b5caebbded97f836e658c59fb0d51

* Add even division checker for offsets in boundary checker (pytorch#1071)

Summary:
Pull Request resolved: pytorch#1071

As title. This might be helpful to detect and check the issues for s268163

Enforce the following check:
1. the size of offsets need to be exactly B * T + 1
2. the last element of offsets should be equal to indices.numel()
3. the max pooling size should be less than or equal to the indice weight size.

Reviewed By: zhenqin95

Differential Revision: D35768276

fbshipit-source-id: d942dfc7b01bfdbcf5b3d3fb76a50f1abe2da325

* Make variable type consistent in CPU code (pytorch#1076)

Summary:
Pull Request resolved: pytorch#1076

Variable types got mixed up in code versions for CPU code. Here we clean it up and make variable types consistent.

Reviewed By: shintaro-iwasaki

Differential Revision: D35817968

fbshipit-source-id: 4de43cbac3388896d1ae81c2eafd0d154dda6fca

* Follow up on throw errors directly on host code for CUDA bounds check op (pytorch#1075)

Summary:
Pull Request resolved: pytorch#1075

Follow up for D35768276 (pytorch@7be1fcb): throw errors directly on host code.

Reviewed By: yinghai

Differential Revision: D35905891

fbshipit-source-id: f97047ff9cb27f7f169dc0223fa0295cc14a8fe8

* Add dtype <-> SparseType conversion util function (pytorch#1057)

Summary:
Pull Request resolved: pytorch#1057

As title

Reviewed By: geyyer

Differential Revision: D35532366

fbshipit-source-id: 73891dd0eadcb0c79d6d0a06d7e0da911bd2519a

* Implement kernel for counter based weight decay and learning rate adjustment in rowwise_adagrad (pytorch#1068)

Summary:
Pull Request resolved: pytorch#1068

Implemented the kernel for counter based weight decay and learning rate adjustment in rowwise_adagrad

Reviewed By: csmiler

Differential Revision: D35758762

fbshipit-source-id: 1953ca950c8ebd3f45c0e5c343a5c2214393b487

* add bf16 support in jagged tensor ops (pytorch#1079)

Summary:
Pull Request resolved: pytorch#1079

To support bf16 training

Reviewed By: ajtulloch

Differential Revision: D35955466

fbshipit-source-id: 0f740f29074576c026005362c78f872fec80bbcc

* allow FP16-type grad_t (pytorch#1072)

Summary:
Pull Request resolved: pytorch#1072

This Diff partially revives D31432199 (pytorch@127f813), but only enables `grad_t = FP16` (no `BF16` support) to reduce the adverse side effect (e.g., the increase of binary size and compilation time).

Specifically, D31432199 (pytorch@127f813) provides FP32, FP16, and BF16 for `grad_t`.
This Diff removes BF16 options for `grad_t` (so only FP32 and FP16 for `grad_t`).

Reviewed By: jianyuh

Differential Revision: D35120293

fbshipit-source-id: b9a1d35f901b26277a220360a2a68583c65c8554

* use shfl_sync instead of __shfl_sync (pytorch#1080)

Summary:
Pull Request resolved: pytorch#1080

This patch replaces CUDA-specific `__shfl_sync` used in D35758762 (pytorch@dfb36cd) with `shfl_sync`, which is a wrapper that supports both NVIDIA and AMD GPUs (like D33231489 (pytorch@c6df576)).

Reviewed By: dvksabin

Differential Revision: D35980472

fbshipit-source-id: f77c9e9dce31d55e80a201f80f98e44bbe8dce9e

* allow specify output_dtype for split no_bag embedding forward (pytorch#1067)

Summary:
"split_embedding_nobag_forward" did not accept "output_dtype" parameters when "{% if not dense and not nobag %}".

So when user created "SplitTableBatchedEmbeddingBagsCodegen" with the "output_dtype" to some type needed, it is not passed into split_embedding_nobag_forward, so the real output data type is not aligned with output_dtype user specified. And also there is no warning or error happens as well.

This PR added the "output_dtype" support for "split_embedding_nobag_forward".

Pull Request resolved: pytorch#1067

Reviewed By: brad-mengchi, shintaro-iwasaki

Differential Revision: D35866293

Pulled By: jianyuh

fbshipit-source-id: 4cf95c649dcd25408668644788f3817561d35c20

* Fix the OSS nightly build; Release FBGEMM v0.1.0 for TorchRec OSS release (pytorch#1088)

Summary:
Pull Request resolved: pytorch#1088

The OSS nightly build for CPU and GPU is broken, due to the package name configuration conflicts between pyproject.toml and setup.py. This Diff removes pyproject.toml and only keep setup.py as the ground truth.

Reviewed By: geyyer, brad-mengchi

Differential Revision: D36040950

fbshipit-source-id: 2ca5a6f1da6cc4e8e1fecdf98c6ef6921cbce4ae

* Fix the OSS CUDA GPG key CI test failure (pytorch#1089)

Summary:
Pull Request resolved: pytorch#1089

Check https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772

This Diff fixes the OSS test failure in https://github.com/pytorch/FBGEMM/runs/6242695168?check_suite_focus=true

Reviewed By: brad-mengchi

Differential Revision: D36048995

fbshipit-source-id: 13fd7fc24c41f4042392849b22e29b8659b782b8

* Add permute_pooled_embedding_ops_split for cpu_only and gpu (pytorch#1082)

Summary:
Pull Request resolved: pytorch#1082

Following up on the post https://fb.workplace.com/groups/2126278550786248/permalink/5353232054757532/

Reviewed By: jianyuh

Differential Revision: D35971699

fbshipit-source-id: a3c8a9d8ce453abb732bd0774cb4f95ef10240f9

* clean up output_dtype tensor allocation branches (pytorch#1086)

Summary:
Pull Request resolved: pytorch#1086

As title

Reviewed By: brad-mengchi

Differential Revision: D36018114

fbshipit-source-id: 9d8d6b5af53a4a75b917dee673629e6feeaa7ba3

* Fix build for embedding_inplace_update/embedding_inplace_update_cpu (pytorch#1081)

Summary: Pull Request resolved: pytorch#1081

Reviewed By: jasonjk-park, jianyuh, houseroad

Differential Revision: D35984814

fbshipit-source-id: 40a4d3dd5cfffb4240b517abb70e723abb396dff

Co-authored-by: Wang Zhou <wangzhou@fb.com>
Co-authored-by: root <root@ixt-rack-61.local.lan>
Co-authored-by: Shabab Ayub <shababayub@fb.com>
Co-authored-by: Jianyu Huang <jianyuhuang@fb.com>
Co-authored-by: Sabin Devkota <devkotasabin@fb.com>
Co-authored-by: Jongsoo Park <jongsoo@fb.com>
Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com>
Co-authored-by: pengwa@microsoft.com <pengwa@microsoft.com>
Co-authored-by: Rostyslav Geyyer <grostyslav@fb.com>
Co-authored-by: Mengchi Zhang <mengchi@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants