Skip to content

expanded weights without fast rules #70140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from
Closed

Conversation

samdow
Copy link
Contributor

@samdow samdow commented Dec 17, 2021

Stack from ghstack:

Design Doc for Expanded Weights <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules.

  • User facing API is in _stateless.py (with documentation)
  • Testing is in test_expanded_weights
  • The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in expanded weights: instance norm faster rules #70141

Differential Revision: D34350950

@pytorch-probot
Copy link

pytorch-probot bot commented Dec 17, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/93f25fc1fe663d2a1ab8b3f06a7e19ca05144d10/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 17, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit f6fa517 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build win-vs2019-cpu-py3 / build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-02-22T18:34:47.4037740Z FAILED: bin/torch_cpu.dll lib/torch_cpu.lib
2022-02-22T18:31:39.5538266Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-gemm\gen\4x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:31:39.5539242Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\1x16c8-minmax-fp32-avx512skx.c.obj 
2022-02-22T18:31:39.5540284Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\1x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:31:39.5541259Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\2x16c8-minmax-fp32-avx512skx.c.obj 
2022-02-22T18:31:39.5542247Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\2x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:31:39.5543236Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\3x16c8-minmax-fp32-avx512skx.c.obj 
2022-02-22T18:31:39.5544205Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\3x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:31:39.5545186Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\4x16c8-minmax-fp32-avx512skx.c.obj 
2022-02-22T18:31:39.5546161Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\4x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:34:47.4035365Z [5/77] cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe  @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp  -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib  && cd ."
2022-02-22T18:34:47.4037740Z FAILED: bin/torch_cpu.dll lib/torch_cpu.lib 
2022-02-22T18:34:47.4040463Z cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe  @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp  -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib  && cd ."
2022-02-22T18:34:47.4045578Z LINK: command "C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_cpu.dll.manifest" failed (exit code 1120) with the following output:
2022-02-22T18:34:47.4061357Z Microsoft (R) Incremental Linker Version 14.28.29337.0
2022-02-22T18:34:47.4062172Z Copyright (C) Microsoft Corporation.  All rights reserved.
2022-02-22T18:34:47.4062669Z 
2022-02-22T18:34:47.4063414Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\AccumulateType.cpp.obj 
2022-02-22T18:34:47.4064544Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchedFallback.cpp.obj 
2022-02-22T18:34:47.4065692Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchedTensorImpl.cpp.obj 
2022-02-22T18:34:47.4066926Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchingRegistrations.cpp.obj 
2022-02-22T18:34:47.4068138Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\CPUGeneratorImpl.cpp.obj 

See GitHub Actions build win-vs2019-cuda11.3-py3 / build (2/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-02-22T18:36:25.2274845Z FAILED: bin/torch_cpu.dll lib/torch_cpu.lib
2022-02-22T18:36:06.7404828Z MultiLabelMarginCriterion.cu
2022-02-22T18:36:09.6688561Z [50/698] C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\bin\randomtemp.exe C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe C:\PROGRA~1\NVIDIA~2\CUDA\v11.3\bin\nvcc.exe -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DBUILD_SPLIT_CUDA -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_CU_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cuda_cu_EXPORTS -IC:\actions-runner\_work\pytorch\pytorch\build\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\src -IC:\actions-runner\_work\pytorch\pytorch\build -IC:\actions-runner\_work\pytorch\pytorch -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\benchmark\include -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\cudnn_frontend\include -IC:\actions-runner\_work\pytorch\pytorch\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\THC -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\cuda -IC:\actions-runner\_work\pytorch\pytorch\build\caffe2\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\..\third_party\catch\single_include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\cuda\..\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\.. -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api\include -isystem=C:\actions-runner\_work\pytorch\pytorch\build\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googlemock\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googletest\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\protobuf\src -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\XNNPACK\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\eigen -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\mkl-dnn\third_party\oneDNN\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\include -isystem="C:\Program Files\NVIDIA Corporation\NvToolsExt\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\magma\include -Xcompiler /w -w -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch --use-local-env -gencode arch=compute_70,code=sm_70 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --Werror cross-execution-space-call --no-host-device-move-forward --expt-relaxed-constexpr --expt-extended-lambda  -Xcompiler=/wd4819,/wd4503,/wd4190,/wd4244,/wd4251,/wd4275,/wd4522 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -Xcompiler /MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD -Xcompiler= -DTORCH_CUDA_CU_BUILD_MAIN_LIB -std=c++14 -MD -MT caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\LinearAlgebra.cu.obj -MF caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\LinearAlgebra.cu.obj.d -x cu -c C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\native\cuda\LinearAlgebra.cu -o caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\LinearAlgebra.cu.obj -Xcompiler=-Fdcaffe2\CMakeFiles\torch_cuda_cu.dir\,-FS
2022-02-22T18:36:09.6705814Z LinearAlgebra.cu
2022-02-22T18:36:13.9432293Z [51/698] C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\bin\randomtemp.exe C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe C:\PROGRA~1\NVIDIA~2\CUDA\v11.3\bin\nvcc.exe -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DBUILD_SPLIT_CUDA -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_CU_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cuda_cu_EXPORTS -IC:\actions-runner\_work\pytorch\pytorch\build\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\src -IC:\actions-runner\_work\pytorch\pytorch\build -IC:\actions-runner\_work\pytorch\pytorch -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\benchmark\include -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\cudnn_frontend\include -IC:\actions-runner\_work\pytorch\pytorch\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\THC -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\cuda -IC:\actions-runner\_work\pytorch\pytorch\build\caffe2\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\..\third_party\catch\single_include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\cuda\..\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\.. -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api\include -isystem=C:\actions-runner\_work\pytorch\pytorch\build\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googlemock\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googletest\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\protobuf\src -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\XNNPACK\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\eigen -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\mkl-dnn\third_party\oneDNN\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\include -isystem="C:\Program Files\NVIDIA Corporation\NvToolsExt\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\magma\include -Xcompiler /w -w -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch --use-local-env -gencode arch=compute_70,code=sm_70 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --Werror cross-execution-space-call --no-host-device-move-forward --expt-relaxed-constexpr --expt-extended-lambda  -Xcompiler=/wd4819,/wd4503,/wd4190,/wd4244,/wd4251,/wd4275,/wd4522 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -Xcompiler /MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD -Xcompiler= -DTORCH_CUDA_CU_BUILD_MAIN_LIB -std=c++14 -MD -MT caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultiMarginLoss.cu.obj -MF caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultiMarginLoss.cu.obj.d -x cu -c C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\native\cuda\MultiMarginLoss.cu -o caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultiMarginLoss.cu.obj -Xcompiler=-Fdcaffe2\CMakeFiles\torch_cuda_cu.dir\,-FS
2022-02-22T18:36:13.9445706Z MultiMarginLoss.cu
2022-02-22T18:36:14.1749681Z [52/698] C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\bin\randomtemp.exe C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe C:\PROGRA~1\NVIDIA~2\CUDA\v11.3\bin\nvcc.exe -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DBUILD_SPLIT_CUDA -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_CU_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cuda_cu_EXPORTS -IC:\actions-runner\_work\pytorch\pytorch\build\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\src -IC:\actions-runner\_work\pytorch\pytorch\build -IC:\actions-runner\_work\pytorch\pytorch -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\benchmark\include -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\cudnn_frontend\include -IC:\actions-runner\_work\pytorch\pytorch\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\THC -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\cuda -IC:\actions-runner\_work\pytorch\pytorch\build\caffe2\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\..\third_party\catch\single_include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\cuda\..\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\.. -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api\include -isystem=C:\actions-runner\_work\pytorch\pytorch\build\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googlemock\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googletest\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\protobuf\src -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\XNNPACK\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\eigen -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\mkl-dnn\third_party\oneDNN\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\include -isystem="C:\Program Files\NVIDIA Corporation\NvToolsExt\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\magma\include -Xcompiler /w -w -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch --use-local-env -gencode arch=compute_70,code=sm_70 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --Werror cross-execution-space-call --no-host-device-move-forward --expt-relaxed-constexpr --expt-extended-lambda  -Xcompiler=/wd4819,/wd4503,/wd4190,/wd4244,/wd4251,/wd4275,/wd4522 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -Xcompiler /MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD -Xcompiler= -DTORCH_CUDA_CU_BUILD_MAIN_LIB -std=c++14 -MD -MT caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\Indexing.cu.obj -MF caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\Indexing.cu.obj.d -x cu -c C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\native\cuda\Indexing.cu -o caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\Indexing.cu.obj -Xcompiler=-Fdcaffe2\CMakeFiles\torch_cuda_cu.dir\,-FS
2022-02-22T18:36:14.1773359Z Indexing.cu
2022-02-22T18:36:19.1057919Z [53/698] C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\bin\randomtemp.exe C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe C:\PROGRA~1\NVIDIA~2\CUDA\v11.3\bin\nvcc.exe -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DBUILD_SPLIT_CUDA -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_CU_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cuda_cu_EXPORTS -IC:\actions-runner\_work\pytorch\pytorch\build\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\src -IC:\actions-runner\_work\pytorch\pytorch\build -IC:\actions-runner\_work\pytorch\pytorch -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\benchmark\include -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\cudnn_frontend\include -IC:\actions-runner\_work\pytorch\pytorch\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\THC -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\cuda -IC:\actions-runner\_work\pytorch\pytorch\build\caffe2\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\..\third_party\catch\single_include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\cuda\..\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\.. -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api\include -isystem=C:\actions-runner\_work\pytorch\pytorch\build\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googlemock\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googletest\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\protobuf\src -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\XNNPACK\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\eigen -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\mkl-dnn\third_party\oneDNN\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\include -isystem="C:\Program Files\NVIDIA Corporation\NvToolsExt\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\magma\include -Xcompiler /w -w -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch --use-local-env -gencode arch=compute_70,code=sm_70 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --Werror cross-execution-space-call --no-host-device-move-forward --expt-relaxed-constexpr --expt-extended-lambda  -Xcompiler=/wd4819,/wd4503,/wd4190,/wd4244,/wd4251,/wd4275,/wd4522 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -Xcompiler /MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD -Xcompiler= -DTORCH_CUDA_CU_BUILD_MAIN_LIB -std=c++14 -MD -MT caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultinomialKernel.cu.obj -MF caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultinomialKernel.cu.obj.d -x cu -c C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\native\cuda\MultinomialKernel.cu -o caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultinomialKernel.cu.obj -Xcompiler=-Fdcaffe2\CMakeFiles\torch_cuda_cu.dir\,-FS
2022-02-22T18:36:19.1071492Z MultinomialKernel.cu
2022-02-22T18:36:25.2271241Z [54/698] cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe  @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp  -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib  && cd ."
2022-02-22T18:36:25.2274845Z FAILED: bin/torch_cpu.dll lib/torch_cpu.lib 
2022-02-22T18:36:25.2278296Z cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe  @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp  -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib  && cd ."
2022-02-22T18:36:25.2284218Z LINK: command "C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_cpu.dll.manifest" failed (exit code 1120) with the following output:
2022-02-22T18:36:25.2287197Z Microsoft (R) Incremental Linker Version 14.28.29337.0
2022-02-22T18:36:25.2288132Z Copyright (C) Microsoft Corporation.  All rights reserved.
2022-02-22T18:36:25.2288687Z 
2022-02-22T18:36:25.2289553Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\AccumulateType.cpp.obj 
2022-02-22T18:36:25.2290894Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchedFallback.cpp.obj 
2022-02-22T18:36:25.2292175Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchedTensorImpl.cpp.obj 
2022-02-22T18:36:25.2293676Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchingRegistrations.cpp.obj 
2022-02-22T18:36:25.2295216Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\CPUGeneratorImpl.cpp.obj 

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@samdow samdow requested a review from zou3519 December 21, 2021 21:12
Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to run but here are some initial comments

Comment on lines 58 to 60
@property
def shape(self):
return self.orig_weight.shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you call expanded_weight.size()? does that return the correct thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added size() function

Copy link
Contributor

@zou3519 zou3519 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up question: (maybe more for @albanD) is it user error to pass a __torch_function__ tensor subclass into some C++ code that requires a Tensor? For the PyTorch frontend API the answer to this is probably no because it should enter __torch_function__ but if users have things like custom C++ operators that they've pybind'ed into Python, do we consider this to be a user error?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an error and the code will use the underlying Tensor as-is. This is how nn.Parameter works today.
But it is a subtlety that the user needs to be aware, as soon as you pass the frontend API binding or enter any other c++ function. Only the c++ Tensor associated with your subclass will exist.

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]

# dependency on `functional_call` means that this can't be exposed in utils
# without creating circular dependency
def per_sample_call(module, batch_size, args, kwargs=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we might want to bikeshed this name some more. I would be really confused if I saw per_sample_call(model) in someone's code if I didn't already know about per-sample-gradients

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely see that--what about call_with_per_sample_gradients?

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in this PR looks good to me but I have a suggestion around organization and testing. This PR introduced:

  1. The ExpandedWeights Object
  2. The per_sample_grad API
  3. A lot of helper functions

Only (1) is being tested in test_expanded_weights.py. (2) and (3) are probably tested in the next PR in the stack (I haven't looked yet) but in general each PR in a stack should be able to stand by themselves. Maybe we should include the per-sample-grad rule for simple layer (like linear) in this PR so we can test the per_sample_call API as well as the helper functions and demonstrate how everything works end-to-end.

Regarding the ExpandedWeights Object -- is there a list of common Tensor attributes to override somewhere? I think we are missing stride(), is_contiguous(), memory_format(), but there might be more. It might be good to find some prominent user of __torch_function__ (I don't know of any) and see what attributes they override

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
@samdow samdow requested a review from zou3519 February 1, 2022 17:03
samdow added 5 commits February 3, 2022 23:30
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
@samdow
Copy link
Contributor Author

samdow commented Feb 7, 2022

@zou3519 The remaining test failures look unrelated to the PR. I can try to look at CI and rebase if it's green.

Added in this update:

  • Functions take in the flattened args and the keys for all kwargs so we can reconstruct the kwargs in the backend
  • Linear's torch_function was moved to the generated layer which means that, like conv, we have to add back in any default kwargs that weren't passed. Moved this to the function level too instead of being in expanded_weights_impl
  • Multiple per sample gradient computations with the same underlying weight will cause it to be accumulated because this is what happens in RNNs. There's also an error if we try to run call_with_per_sample_grads while any of the weights have the grad_sample field set

samdow added 2 commits February 7, 2022 17:13
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:

  • check to see if _make_wrapper_subclass works
  • (rzou): aux_output, num_true_outputs is a bit weird, check next stack up

results = []

results.append(grad_if_exists_for_input(input, lambda: grad_output.matmul(unpack_expanded_weight_or_tensor(weight))))
results.extend([None] * 3) # weight and bias don't compute batched gradients
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused -- why 3? This means we're returning a total of 4 values from backward, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input, weight, bias, kwarg_names are the inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update hopefully made this easier to understand because kwarg_names is at the front so the None for it gets added at the start instead of here

class ExpandedWeight(torch.Tensor):
def __init__(self, orig_weight, batch_size):
self.batch_size = batch_size
self.orig_weight = orig_weight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ExpandedWeight object has orig_weight's data, and then we're assigning orig_weight to it. So this effectively doubles the parameter

Comment on lines 16 to 20
expanded_args_without_kwargs = expanded_args[:2]
output, aux_outputs = forward_helper(F.linear, expanded_args_without_kwargs, expanded_kwargs, 1)
ctx.args = expanded_args_without_kwargs
ctx.kwargs = expanded_kwargs
ctx.aux_outputs = aux_outputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay the reason why I think these lines tripped me up is that they are generic and that makes them difficult to read.

Roughly speaking, here's what we're doing in the forward pass of this (and all autograd.Function for per-sample-grads):

  • run f(*unexpanded_args, *unexpanded_kwargs) (or some hacked version of f, if we need intermediate values)
  • save values for backward: save the required unexpanded arguments and intermediates
  • return the output (as opposed to all intermediate values)

Now, the reason why this code is confusing is that it's not clear what is being saved. We are saving all the args and kwargs, but we're also saving "aux outputs", which turn out to be nothing in this case.

It would make sense for this to be generic if we planned to refactor all of the autograd.Function forward passes to look the same. Is that a good idea?

For F.linear -- there is actually no need to save bias for the backward pass, and if input does not require gradient, then there is no need to save the input! (doesn't need to happen in this PR, but those are potential optimizations). So we might not want all the autograd.Function forward passes to look similar (especially because there are already checks specific to F.linear here)

If we decide that we want this to look generic, I'd probably recommend aux_outputs be renamed to intermediates.

If we decide we don't want this to look generic, It could read better as:

output, = forward_helper(F.linear, expanded_args, expanded_kwargs)
ctx.unexpanded_weight = expanded_args[1]
return output

For e.g. group_norm this could look like:

output, mean, rstd = forward_helper(torch.aten.ops.native_group_norm, expanded_args, expanded_kwargs)
ctx.mean = mean
ctx.rstd = rstd
return output

The benefit of the non-generic form is that one doesn't have to go deep diving into the backward() part of the autograd.Function to see exactly what args, kwargs, aux_outputs are.

samdow added 2 commits February 11, 2022 23:00
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback)
 - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141

[ghstack-poisoned]
@samdow
Copy link
Contributor Author

samdow commented Feb 15, 2022

@zou3519 All the comments make sense! I moved the output unpacking to be function specific and added the comments. Per offline discussion, _make_wrapper_subclass didn't work without a __torch_dispatch__ function

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

We should probably file a follow-up issue to see if we actually duplicate the memory when using _make_subclass

@samdow
Copy link
Contributor Author

samdow commented Feb 18, 2022

@samdow has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

samdow added 2 commits February 22, 2022 17:07
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights
 - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141

Differential Revision: [D34350950](https://our.internmc.facebook.com/intern/diff/D34350950)

[ghstack-poisoned]
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. 
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights
 - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141

Differential Revision: [D34350950](https://our.internmc.facebook.com/intern/diff/D34350950)

[ghstack-poisoned]
@samdow
Copy link
Contributor Author

samdow commented Feb 22, 2022

@samdow has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Feb 22, 2022
Summary:
Pull Request resolved: #70140

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules.
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights
 - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D34350950

Pulled By: samdow

fbshipit-source-id: 69c664b0bc3dff6951358d79d7e5d94882f7aef2
@github-actions
Copy link
Contributor

Hey @samdow.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@samdow
Copy link
Contributor Author

samdow commented Feb 28, 2022

Added not user facing because we don't add release notes for prototype features. Proper tags to be added when this becomes beta

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Summary:
Pull Request resolved: pytorch/pytorch#70140

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules.
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights
 - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D34350950

Pulled By: samdow

fbshipit-source-id: 69c664b0bc3dff6951358d79d7e5d94882f7aef2
(cherry picked from commit ae1620d3b6507b27c3bc08ecfb2b1418aa8ce7d7)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Summary:
Pull Request resolved: pytorch/pytorch#70140

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules.
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights
 - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D34350950

Pulled By: samdow

fbshipit-source-id: 69c664b0bc3dff6951358d79d7e5d94882f7aef2
(cherry picked from commit ae1620d3b6507b27c3bc08ecfb2b1418aa8ce7d7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants