per-group and per-channel quantization #25

jspark1105 · 2018-11-24T01:33:19Z

Summary:
Per-group and per-channel quantization in fbgemm
This diff also cleans up explicit template instantiation using macro expansion
Using this in DNNLOWP operators will be done in a separate diff.

Differential Revision: D13176386

Summary: Pull Request resolved: pytorch#14340 Pull Request resolved: pytorch/FBGEMM#25 Per-group and per-channel quantization in fbgemm This diff also cleans up explicit template instantiation using macro expansion This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors. Using this in DNNLOWP operators will be done in a separate diff. Differential Revision: D13176386 fbshipit-source-id: 3137039d2822e42a16881638d54897d9c8bc75f4

Differential Revision: D13166591 fbshipit-source-id: 749815ff7efbb17a1853381c42b5dc6b32d71919

Differential Revision: D13167073 fbshipit-source-id: 6749e2df85d64572b0d0e261b0beff0b206a52f9

Differential Revision: D13176477 fbshipit-source-id: 670a43fd691ef2840262bb0b839794278d3656d7

Summary: Pull Request resolved: pytorch/pytorch#14340 Pull Request resolved: pytorch#25 Per-group and per-channel quantization in fbgemm This diff also cleans up explicit template instantiation using macro expansion This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors. Using this in DNNLOWP operators will be done in a separate diff. Differential Revision: D13176386 fbshipit-source-id: e08c676b6b9cf301f76b87cdb901ecc51c4cc8a4

Summary: Pull Request resolved: #14340 Pull Request resolved: pytorch/FBGEMM#25 Per-group and per-channel quantization in fbgemm This diff also cleans up explicit template instantiation using macro expansion This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors. Using this in DNNLOWP operators will be done in a separate diff. Reviewed By: dskhudia Differential Revision: D13176386 fbshipit-source-id: e46c53e31e21520bded71b8ed86e8b19e010e2dd

* Alinging with upstream with merge_pooled_embeddings_test.py and enabling cuda. * Disabling use_cpu in split_table_batched_embeddings_test since it's still unstable. Co-authored-by: root <root@ixt-rack-61.local.lan>

* Make WeightDecayMode consistent (pytorch#1063) Summary: Pull Request resolved: pytorch#1063 Currently in FE we define `L2=1` and `DECOUPLE=2` but in FBGEMM we use `L2=0` and `DECOUPLE=1` (https://fburl.com/code/65u4a608). While function-wise it is OK since the interface is converted, it may introduce unnecessary confusion on the numbering. Here we make them consistent acrossing FE/BE by using `L2=1` and `DECOUPLE=2` for both. Differential Revision: D35763365 fbshipit-source-id: c61041f38844b02fdecac0fb1182a3184711d3bd * Add default values for func args in FBGEMM codegen (pytorch#1066) Summary: Pull Request resolved: pytorch#1066 We enforce mandate default values for float/int function args (usually hyper-parameters for optimizers) when generating FBGEMM code using codegen. This makes backward compatibility easier as we can add more parameters without breaking compatibility. Note: developers need to be cautious when adding new args with default values. The behavior should remain the same with default values. If no default values are provided for float/int parameters, they'll be set to 0.0/0 by default. Reviewed By: jianyuh Differential Revision: D35795294 fbshipit-source-id: 2632e1452c164d2ae7f999e9b17033ea77fe3864 * Enabling cuda (#25) * Alinging with upstream with merge_pooled_embeddings_test.py and enabling cuda. * Disabling use_cpu in split_table_batched_embeddings_test since it's still unstable. Co-authored-by: root <root@ixt-rack-61.local.lan> * enable merge_pooled_embeddings in oss (pytorch#1064) Summary: Pull Request resolved: pytorch#1064 In inference OSS we need to build fbgemm from source and we need the `merge_pooled_embeddings` operator. This is not available in fbgemm oss because of this: https://www.internalfb.com/diff/D30037992 (pytorch@41ab9713cb1c083414bd9759ebb95d47609101b7)?dst_version_fbid=1066324687448445&transaction_fbid=198310519085547, a dependency on nvml.h. However, generally nvml.h is present on systems and can be located at: `${CUDA_TOOLKIT_ROOT_DIR}/lib64/stubs/libnvidia-ml.so`, as detailed here: https://tianyuliukingcrimson.wordpress.com/2018/07/23/findnvml-cmake-done-correctly-how-to-have-cmake-find-nvidia-management-library-nvml-on-windows-and-linux/. **However**, sometimes systems don't have it preinstalled with cuda for whatever reason, in which case you can get it by installing cudatoolkit-dev: `conda install -c conda-forge cudatoolkit-dev` (as i had to for my system) This changes the path that `libnvidia-ml.so` exists on, so we can give the option for people to specify where this library lives: `nvml_lib_path` post: https://fb.workplace.com/groups/2126278550786248/posts/5357069087707162 Reviewed By: jspark1105 Differential Revision: D35785768 fbshipit-source-id: a2cb10fb54d5d97cbb6ecadfbbcb0c37bce7043b * Add GLIBCXX_USE_CXX11_ABI compile option (pytorch#1073) Summary: Pull Request resolved: pytorch#1073 Reviewed By: s4ayub Differential Revision: D35682606 fbshipit-source-id: 58c78ec52a9b5caebbded97f836e658c59fb0d51 * Add even division checker for offsets in boundary checker (pytorch#1071) Summary: Pull Request resolved: pytorch#1071 As title. This might be helpful to detect and check the issues for s268163 Enforce the following check: 1. the size of offsets need to be exactly B * T + 1 2. the last element of offsets should be equal to indices.numel() 3. the max pooling size should be less than or equal to the indice weight size. Reviewed By: zhenqin95 Differential Revision: D35768276 fbshipit-source-id: d942dfc7b01bfdbcf5b3d3fb76a50f1abe2da325 * Make variable type consistent in CPU code (pytorch#1076) Summary: Pull Request resolved: pytorch#1076 Variable types got mixed up in code versions for CPU code. Here we clean it up and make variable types consistent. Reviewed By: shintaro-iwasaki Differential Revision: D35817968 fbshipit-source-id: 4de43cbac3388896d1ae81c2eafd0d154dda6fca * Follow up on throw errors directly on host code for CUDA bounds check op (pytorch#1075) Summary: Pull Request resolved: pytorch#1075 Follow up for D35768276 (pytorch@7be1fcb): throw errors directly on host code. Reviewed By: yinghai Differential Revision: D35905891 fbshipit-source-id: f97047ff9cb27f7f169dc0223fa0295cc14a8fe8 * Add dtype <-> SparseType conversion util function (pytorch#1057) Summary: Pull Request resolved: pytorch#1057 As title Reviewed By: geyyer Differential Revision: D35532366 fbshipit-source-id: 73891dd0eadcb0c79d6d0a06d7e0da911bd2519a * Implement kernel for counter based weight decay and learning rate adjustment in rowwise_adagrad (pytorch#1068) Summary: Pull Request resolved: pytorch#1068 Implemented the kernel for counter based weight decay and learning rate adjustment in rowwise_adagrad Reviewed By: csmiler Differential Revision: D35758762 fbshipit-source-id: 1953ca950c8ebd3f45c0e5c343a5c2214393b487 * add bf16 support in jagged tensor ops (pytorch#1079) Summary: Pull Request resolved: pytorch#1079 To support bf16 training Reviewed By: ajtulloch Differential Revision: D35955466 fbshipit-source-id: 0f740f29074576c026005362c78f872fec80bbcc * allow FP16-type grad_t (pytorch#1072) Summary: Pull Request resolved: pytorch#1072 This Diff partially revives D31432199 (pytorch@127f813), but only enables `grad_t = FP16` (no `BF16` support) to reduce the adverse side effect (e.g., the increase of binary size and compilation time). Specifically, D31432199 (pytorch@127f813) provides FP32, FP16, and BF16 for `grad_t`. This Diff removes BF16 options for `grad_t` (so only FP32 and FP16 for `grad_t`). Reviewed By: jianyuh Differential Revision: D35120293 fbshipit-source-id: b9a1d35f901b26277a220360a2a68583c65c8554 * use shfl_sync instead of __shfl_sync (pytorch#1080) Summary: Pull Request resolved: pytorch#1080 This patch replaces CUDA-specific `__shfl_sync` used in D35758762 (pytorch@dfb36cd) with `shfl_sync`, which is a wrapper that supports both NVIDIA and AMD GPUs (like D33231489 (pytorch@c6df576)). Reviewed By: dvksabin Differential Revision: D35980472 fbshipit-source-id: f77c9e9dce31d55e80a201f80f98e44bbe8dce9e * allow specify output_dtype for split no_bag embedding forward (pytorch#1067) Summary: "split_embedding_nobag_forward" did not accept "output_dtype" parameters when "{% if not dense and not nobag %}". So when user created "SplitTableBatchedEmbeddingBagsCodegen" with the "output_dtype" to some type needed, it is not passed into split_embedding_nobag_forward, so the real output data type is not aligned with output_dtype user specified. And also there is no warning or error happens as well. This PR added the "output_dtype" support for "split_embedding_nobag_forward". Pull Request resolved: pytorch#1067 Reviewed By: brad-mengchi, shintaro-iwasaki Differential Revision: D35866293 Pulled By: jianyuh fbshipit-source-id: 4cf95c649dcd25408668644788f3817561d35c20 * Fix the OSS nightly build; Release FBGEMM v0.1.0 for TorchRec OSS release (pytorch#1088) Summary: Pull Request resolved: pytorch#1088 The OSS nightly build for CPU and GPU is broken, due to the package name configuration conflicts between pyproject.toml and setup.py. This Diff removes pyproject.toml and only keep setup.py as the ground truth. Reviewed By: geyyer, brad-mengchi Differential Revision: D36040950 fbshipit-source-id: 2ca5a6f1da6cc4e8e1fecdf98c6ef6921cbce4ae * Fix the OSS CUDA GPG key CI test failure (pytorch#1089) Summary: Pull Request resolved: pytorch#1089 Check https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772 This Diff fixes the OSS test failure in https://github.com/pytorch/FBGEMM/runs/6242695168?check_suite_focus=true Reviewed By: brad-mengchi Differential Revision: D36048995 fbshipit-source-id: 13fd7fc24c41f4042392849b22e29b8659b782b8 * Add permute_pooled_embedding_ops_split for cpu_only and gpu (pytorch#1082) Summary: Pull Request resolved: pytorch#1082 Following up on the post https://fb.workplace.com/groups/2126278550786248/permalink/5353232054757532/ Reviewed By: jianyuh Differential Revision: D35971699 fbshipit-source-id: a3c8a9d8ce453abb732bd0774cb4f95ef10240f9 * clean up output_dtype tensor allocation branches (pytorch#1086) Summary: Pull Request resolved: pytorch#1086 As title Reviewed By: brad-mengchi Differential Revision: D36018114 fbshipit-source-id: 9d8d6b5af53a4a75b917dee673629e6feeaa7ba3 * Fix build for embedding_inplace_update/embedding_inplace_update_cpu (pytorch#1081) Summary: Pull Request resolved: pytorch#1081 Reviewed By: jasonjk-park, jianyuh, houseroad Differential Revision: D35984814 fbshipit-source-id: 40a4d3dd5cfffb4240b517abb70e723abb396dff Co-authored-by: Wang Zhou <wangzhou@fb.com> Co-authored-by: root <root@ixt-rack-61.local.lan> Co-authored-by: Shabab Ayub <shababayub@fb.com> Co-authored-by: Jianyu Huang <jianyuhuang@fb.com> Co-authored-by: Sabin Devkota <devkotasabin@fb.com> Co-authored-by: Jongsoo Park <jongsoo@fb.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: pengwa@microsoft.com <pengwa@microsoft.com> Co-authored-by: Rostyslav Geyyer <grostyslav@fb.com> Co-authored-by: Mengchi Zhang <mengchi@fb.com>

…granu Enable arbitrary embedding dimensions for ROCm

jspark1105 force-pushed the export-D13176386 branch from 24815db to ba364c6 Compare November 26, 2018 01:01

jspark1105 added 4 commits November 25, 2018 21:27

minimize code compiled with avx2 and header includes from them

a1c878c

Differential Revision: D13166591 fbshipit-source-id: 749815ff7efbb17a1853381c42b5dc6b32d71919

remove unnecessary zero_point argument from constructors

a2ebff5

Differential Revision: D13167073 fbshipit-source-id: 6749e2df85d64572b0d0e261b0beff0b206a52f9

fix group convention in B packing

58a9688

Differential Revision: D13176477 fbshipit-source-id: 670a43fd691ef2840262bb0b839794278d3656d7

jspark1105 force-pushed the export-D13176386 branch from ba364c6 to 86109d3 Compare November 26, 2018 05:27

facebook-github-bot closed this in d4ee77f Nov 27, 2018

liligwu added a commit to liligwu/FBGEMM that referenced this pull request Nov 30, 2022

Merge pull request pytorch#25 from ROCmSoftwarePlatform/perf-emb-dim-…

0ecb9c2

…granu Enable arbitrary embedding dimensions for ROCm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

per-group and per-channel quantization #25

per-group and per-channel quantization #25

Uh oh!

jspark1105 commented Nov 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

per-group and per-channel quantization #25

per-group and per-channel quantization #25

Uh oh!

Conversation

jspark1105 commented Nov 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant