expanded weights without fast rules #70140

samdow · 2021-12-17T23:18:46Z

Stack from ghstack:

Design Doc for Expanded Weights <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules.

User facing API is in _stateless.py (with documentation)
Testing is in test_expanded_weights
The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in expanded weights: instance norm faster rules #70141

Differential Revision: D34350950

[ghstack-poisoned]

pytorch-probot · 2021-12-17T23:18:49Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/93f25fc1fe663d2a1ab8b3f06a7e19ca05144d10/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-12-17T23:18:51Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/70140
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
↩️ [fb-only] Re-run with SSH instructions
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit f6fa517 (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

win-vs2019-cpu-py3 / build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-02-22T18:34:47.4037740Z FAILED: bin/torch_cpu.dll lib/torch_cpu.lib

2022-02-22T18:31:39.5538266Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-gemm\gen\4x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:31:39.5539242Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\1x16c8-minmax-fp32-avx512skx.c.obj 
2022-02-22T18:31:39.5540284Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\1x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:31:39.5541259Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\2x16c8-minmax-fp32-avx512skx.c.obj 
2022-02-22T18:31:39.5542247Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\2x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:31:39.5543236Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\3x16c8-minmax-fp32-avx512skx.c.obj 
2022-02-22T18:31:39.5544205Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\3x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:31:39.5545186Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\4x16c8-minmax-fp32-avx512skx.c.obj 
2022-02-22T18:31:39.5546161Z confu-deps\XNNPACK\CMakeFiles\XNNPACK.dir\src\qs8-igemm\gen\4x16c8-minmax-gemmlowp-avx512skx.c.obj 
2022-02-22T18:34:47.4035365Z [5/77] cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe  @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp  -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib  && cd ."
2022-02-22T18:34:47.4037740Z FAILED: bin/torch_cpu.dll lib/torch_cpu.lib 
2022-02-22T18:34:47.4040463Z cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe  @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp  -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib  && cd ."
2022-02-22T18:34:47.4045578Z LINK: command "C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_cpu.dll.manifest" failed (exit code 1120) with the following output:
2022-02-22T18:34:47.4061357Z Microsoft (R) Incremental Linker Version 14.28.29337.0
2022-02-22T18:34:47.4062172Z Copyright (C) Microsoft Corporation.  All rights reserved.
2022-02-22T18:34:47.4062669Z 
2022-02-22T18:34:47.4063414Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\AccumulateType.cpp.obj 
2022-02-22T18:34:47.4064544Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchedFallback.cpp.obj 
2022-02-22T18:34:47.4065692Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchedTensorImpl.cpp.obj 
2022-02-22T18:34:47.4066926Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchingRegistrations.cpp.obj 
2022-02-22T18:34:47.4068138Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\CPUGeneratorImpl.cpp.obj

win-vs2019-cuda11.3-py3 / build (2/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-02-22T18:36:25.2274845Z FAILED: bin/torch_cpu.dll lib/torch_cpu.lib

2022-02-22T18:36:06.7404828Z MultiLabelMarginCriterion.cu
2022-02-22T18:36:09.6688561Z [50/698] C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\bin\randomtemp.exe C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe C:\PROGRA~1\NVIDIA~2\CUDA\v11.3\bin\nvcc.exe -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DBUILD_SPLIT_CUDA -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_CU_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cuda_cu_EXPORTS -IC:\actions-runner\_work\pytorch\pytorch\build\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\src -IC:\actions-runner\_work\pytorch\pytorch\build -IC:\actions-runner\_work\pytorch\pytorch -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\benchmark\include -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\cudnn_frontend\include -IC:\actions-runner\_work\pytorch\pytorch\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\THC -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\cuda -IC:\actions-runner\_work\pytorch\pytorch\build\caffe2\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\..\third_party\catch\single_include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\cuda\..\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\.. -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api\include -isystem=C:\actions-runner\_work\pytorch\pytorch\build\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googlemock\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googletest\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\protobuf\src -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\XNNPACK\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\eigen -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\mkl-dnn\third_party\oneDNN\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\include -isystem="C:\Program Files\NVIDIA Corporation\NvToolsExt\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\magma\include -Xcompiler /w -w -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch --use-local-env -gencode arch=compute_70,code=sm_70 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --Werror cross-execution-space-call --no-host-device-move-forward --expt-relaxed-constexpr --expt-extended-lambda  -Xcompiler=/wd4819,/wd4503,/wd4190,/wd4244,/wd4251,/wd4275,/wd4522 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -Xcompiler /MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD -Xcompiler= -DTORCH_CUDA_CU_BUILD_MAIN_LIB -std=c++14 -MD -MT caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\LinearAlgebra.cu.obj -MF caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\LinearAlgebra.cu.obj.d -x cu -c C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\native\cuda\LinearAlgebra.cu -o caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\LinearAlgebra.cu.obj -Xcompiler=-Fdcaffe2\CMakeFiles\torch_cuda_cu.dir\,-FS
2022-02-22T18:36:09.6705814Z LinearAlgebra.cu
2022-02-22T18:36:13.9432293Z [51/698] C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\bin\randomtemp.exe C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe C:\PROGRA~1\NVIDIA~2\CUDA\v11.3\bin\nvcc.exe -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DBUILD_SPLIT_CUDA -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_CU_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cuda_cu_EXPORTS -IC:\actions-runner\_work\pytorch\pytorch\build\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\src -IC:\actions-runner\_work\pytorch\pytorch\build -IC:\actions-runner\_work\pytorch\pytorch -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\benchmark\include -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\cudnn_frontend\include -IC:\actions-runner\_work\pytorch\pytorch\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\THC -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\cuda -IC:\actions-runner\_work\pytorch\pytorch\build\caffe2\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\..\third_party\catch\single_include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\cuda\..\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\.. -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api\include -isystem=C:\actions-runner\_work\pytorch\pytorch\build\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googlemock\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googletest\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\protobuf\src -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\XNNPACK\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\eigen -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\mkl-dnn\third_party\oneDNN\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\include -isystem="C:\Program Files\NVIDIA Corporation\NvToolsExt\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\magma\include -Xcompiler /w -w -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch --use-local-env -gencode arch=compute_70,code=sm_70 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --Werror cross-execution-space-call --no-host-device-move-forward --expt-relaxed-constexpr --expt-extended-lambda  -Xcompiler=/wd4819,/wd4503,/wd4190,/wd4244,/wd4251,/wd4275,/wd4522 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -Xcompiler /MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD -Xcompiler= -DTORCH_CUDA_CU_BUILD_MAIN_LIB -std=c++14 -MD -MT caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultiMarginLoss.cu.obj -MF caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultiMarginLoss.cu.obj.d -x cu -c C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\native\cuda\MultiMarginLoss.cu -o caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultiMarginLoss.cu.obj -Xcompiler=-Fdcaffe2\CMakeFiles\torch_cuda_cu.dir\,-FS
2022-02-22T18:36:13.9445706Z MultiMarginLoss.cu
2022-02-22T18:36:14.1749681Z [52/698] C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\bin\randomtemp.exe C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe C:\PROGRA~1\NVIDIA~2\CUDA\v11.3\bin\nvcc.exe -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DBUILD_SPLIT_CUDA -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_CU_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cuda_cu_EXPORTS -IC:\actions-runner\_work\pytorch\pytorch\build\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\src -IC:\actions-runner\_work\pytorch\pytorch\build -IC:\actions-runner\_work\pytorch\pytorch -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\benchmark\include -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\cudnn_frontend\include -IC:\actions-runner\_work\pytorch\pytorch\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\THC -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\cuda -IC:\actions-runner\_work\pytorch\pytorch\build\caffe2\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\..\third_party\catch\single_include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\cuda\..\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\.. -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api\include -isystem=C:\actions-runner\_work\pytorch\pytorch\build\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googlemock\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googletest\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\protobuf\src -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\XNNPACK\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\eigen -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\mkl-dnn\third_party\oneDNN\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\include -isystem="C:\Program Files\NVIDIA Corporation\NvToolsExt\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\magma\include -Xcompiler /w -w -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch --use-local-env -gencode arch=compute_70,code=sm_70 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --Werror cross-execution-space-call --no-host-device-move-forward --expt-relaxed-constexpr --expt-extended-lambda  -Xcompiler=/wd4819,/wd4503,/wd4190,/wd4244,/wd4251,/wd4275,/wd4522 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -Xcompiler /MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD -Xcompiler= -DTORCH_CUDA_CU_BUILD_MAIN_LIB -std=c++14 -MD -MT caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\Indexing.cu.obj -MF caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\Indexing.cu.obj.d -x cu -c C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\native\cuda\Indexing.cu -o caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\Indexing.cu.obj -Xcompiler=-Fdcaffe2\CMakeFiles\torch_cuda_cu.dir\,-FS
2022-02-22T18:36:14.1773359Z Indexing.cu
2022-02-22T18:36:19.1057919Z [53/698] C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\bin\randomtemp.exe C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe C:\PROGRA~1\NVIDIA~2\CUDA\v11.3\bin\nvcc.exe -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DBUILD_SPLIT_CUDA -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_CU_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cuda_cu_EXPORTS -IC:\actions-runner\_work\pytorch\pytorch\build\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\src -IC:\actions-runner\_work\pytorch\pytorch\build -IC:\actions-runner\_work\pytorch\pytorch -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\benchmark\include -IC:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\cudnn_frontend\include -IC:\actions-runner\_work\pytorch\pytorch\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\onnx -IC:\actions-runner\_work\pytorch\pytorch\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\third_party\foxi -IC:\actions-runner\_work\pytorch\pytorch\build\include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\THC -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\cuda -IC:\actions-runner\_work\pytorch\pytorch\build\caffe2\aten\src -IC:\actions-runner\_work\pytorch\pytorch\aten\..\third_party\catch\single_include -IC:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\cuda\..\.. -IC:\actions-runner\_work\pytorch\pytorch\c10\.. -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api -IC:\actions-runner\_work\pytorch\pytorch\torch\csrc\api\include -isystem=C:\actions-runner\_work\pytorch\pytorch\build\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\gloo -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googlemock\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\googletest\googletest\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\protobuf\src -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\mkl\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\XNNPACK\include -isystem=C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\eigen -isystem="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\mkl-dnn\third_party\oneDNN\include -isystem=C:\actions-runner\_work\pytorch\pytorch\third_party\ideep\include -isystem="C:\Program Files\NVIDIA Corporation\NvToolsExt\include" -isystem=C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\magma\include -Xcompiler /w -w -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch --use-local-env -gencode arch=compute_70,code=sm_70 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --Werror cross-execution-space-call --no-host-device-move-forward --expt-relaxed-constexpr --expt-extended-lambda  -Xcompiler=/wd4819,/wd4503,/wd4190,/wd4244,/wd4251,/wd4275,/wd4522 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -Xcompiler /MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD -Xcompiler= -DTORCH_CUDA_CU_BUILD_MAIN_LIB -std=c++14 -MD -MT caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultinomialKernel.cu.obj -MF caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultinomialKernel.cu.obj.d -x cu -c C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen\native\cuda\MultinomialKernel.cu -o caffe2\CMakeFiles\torch_cuda_cu.dir\__\aten\src\ATen\native\cuda\MultinomialKernel.cu.obj -Xcompiler=-Fdcaffe2\CMakeFiles\torch_cuda_cu.dir\,-FS
2022-02-22T18:36:19.1071492Z MultinomialKernel.cu
2022-02-22T18:36:25.2271241Z [54/698] cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe  @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp  -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib  && cd ."
2022-02-22T18:36:25.2274845Z FAILED: bin/torch_cpu.dll lib/torch_cpu.lib 
2022-02-22T18:36:25.2278296Z cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe  @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp  -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib  && cd ."
2022-02-22T18:36:25.2284218Z LINK: command "C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:vcomp -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_cpu.dll.manifest" failed (exit code 1120) with the following output:
2022-02-22T18:36:25.2287197Z Microsoft (R) Incremental Linker Version 14.28.29337.0
2022-02-22T18:36:25.2288132Z Copyright (C) Microsoft Corporation.  All rights reserved.
2022-02-22T18:36:25.2288687Z 
2022-02-22T18:36:25.2289553Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\AccumulateType.cpp.obj 
2022-02-22T18:36:25.2290894Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchedFallback.cpp.obj 
2022-02-22T18:36:25.2292175Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchedTensorImpl.cpp.obj 
2022-02-22T18:36:25.2293676Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\BatchingRegistrations.cpp.obj 
2022-02-22T18:36:25.2295216Z caffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\CPUGeneratorImpl.cpp.obj

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

torch/expanded_weights/expanded_weights_impl.py

[ghstack-poisoned]

zou3519

Need to run but here are some initial comments

test/test_expanded_weights.py

torch/nn/utils/_stateless.py

zou3519 · 2021-12-21T22:34:20Z

torch/_expanded_weights/expanded_weights_impl.py

+    @property
+    def shape(self):
+        return self.orig_weight.shape


What happens if you call expanded_weight.size()? does that return the correct thing?

Added size() function

Follow-up question: (maybe more for @albanD) is it user error to pass a __torch_function__ tensor subclass into some C++ code that requires a Tensor? For the PyTorch frontend API the answer to this is probably no because it should enter __torch_function__ but if users have things like custom C++ operators that they've pybind'ed into Python, do we consider this to be a user error?

It is not an error and the code will use the underlying Tensor as-is. This is how nn.Parameter works today.
But it is a subtlety that the user needs to be aware, as soon as you pass the frontend API binding or enter any other c++ function. Only the c++ Tensor associated with your subclass will exist.

torch/_expanded_weights/expanded_weights_impl.py

test/test_expanded_weights.py

torch/_expanded_weights/expanded_weights_impl.py

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback) - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141 [ghstack-poisoned]

torch/_expanded_weights/expanded_weights_impl.py

docs/source/nn.rst

zou3519 · 2021-12-23T20:49:35Z

torch/nn/utils/_per_sample_grad.py

+
+# dependency on `functional_call` means that this can't be exposed in utils
+# without creating circular dependency
+def per_sample_call(module, batch_size, args, kwargs=None):


nit: we might want to bikeshed this name some more. I would be really confused if I saw per_sample_call(model) in someone's code if I didn't already know about per-sample-gradients

Yeah definitely see that--what about call_with_per_sample_gradients?

torch/nn/utils/_per_sample_grad.py

torch/_expanded_weights/expanded_weights_utils.py

zou3519

The code in this PR looks good to me but I have a suggestion around organization and testing. This PR introduced:

The ExpandedWeights Object
The per_sample_grad API
A lot of helper functions

Only (1) is being tested in test_expanded_weights.py. (2) and (3) are probably tested in the next PR in the stack (I haven't looked yet) but in general each PR in a stack should be able to stand by themselves. Maybe we should include the per-sample-grad rule for simple layer (like linear) in this PR so we can test the per_sample_call API as well as the helper functions and demonstrate how everything works end-to-end.

Regarding the ExpandedWeights Object -- is there a list of common Tensor attributes to override somewhere? I think we are missing stride(), is_contiguous(), memory_format(), but there might be more. It might be good to find some prominent user of __torch_function__ (I don't know of any) and see what attributes they override

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback) - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141 [ghstack-poisoned]

torch/nn/utils/_per_sample_grad.py

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback) - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141 [ghstack-poisoned]

torch/nn/utils/_expanded_weights/expanded_weights_utils.py

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback) - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141 [ghstack-poisoned]

samdow · 2022-02-07T15:07:47Z

@zou3519 The remaining test failures look unrelated to the PR. I can try to look at CI and rebase if it's green.

Added in this update:

Functions take in the flattened args and the keys for all kwargs so we can reconstruct the kwargs in the backend
Linear's torch_function was moved to the generated layer which means that, like conv, we have to add back in any default kwargs that weren't passed. Moved this to the function level too instead of being in expanded_weights_impl
Multiple per sample gradient computations with the same underlying weight will cause it to be accumulated because this is what happens in RNNs. There's also an error if we try to run call_with_per_sample_grads while any of the weights have the grad_sample field set

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback) - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141 [ghstack-poisoned]

zou3519

TODO:

check to see if _make_wrapper_subclass works
(rzou): aux_output, num_true_outputs is a bit weird, check next stack up

zou3519 · 2022-02-10T14:20:53Z

torch/nn/utils/_expanded_weights/linear_expanded_weights.py

+        results = []
+
+        results.append(grad_if_exists_for_input(input, lambda: grad_output.matmul(unpack_expanded_weight_or_tensor(weight))))
+        results.extend([None] * 3)  # weight and bias don't compute batched gradients


I'm confused -- why 3? This means we're returning a total of 4 values from backward, right?

input, weight, bias, kwarg_names are the inputs

The update hopefully made this easier to understand because kwarg_names is at the front so the None for it gets added at the start instead of here

torch/nn/utils/_expanded_weights/expanded_weights_utils.py

torch/nn/utils/_expanded_weights/expanded_weights_impl.py

torch/nn/utils/_expanded_weights/expanded_weights_utils.py

torch/nn/utils/_expanded_weights/linear_expanded_weights.py

test/test_expanded_weights.py

zou3519 · 2022-02-10T15:18:00Z

torch/nn/utils/_expanded_weights/expanded_weights_impl.py

+class ExpandedWeight(torch.Tensor):
+    def __init__(self, orig_weight, batch_size):
+        self.batch_size = batch_size
+        self.orig_weight = orig_weight


The ExpandedWeight object has orig_weight's data, and then we're assigning orig_weight to it. So this effectively doubles the parameter

torch/nn/utils/_expanded_weights/expanded_weights_impl.py

torch/nn/utils/_expanded_weights/linear_expanded_weights.py

zou3519 · 2022-02-11T14:35:47Z

torch/nn/utils/_expanded_weights/linear_expanded_weights.py

+        expanded_args_without_kwargs = expanded_args[:2]
+        output, aux_outputs = forward_helper(F.linear, expanded_args_without_kwargs, expanded_kwargs, 1)
+        ctx.args = expanded_args_without_kwargs
+        ctx.kwargs = expanded_kwargs
+        ctx.aux_outputs = aux_outputs


Okay the reason why I think these lines tripped me up is that they are generic and that makes them difficult to read.

Roughly speaking, here's what we're doing in the forward pass of this (and all autograd.Function for per-sample-grads):

run f(*unexpanded_args, *unexpanded_kwargs) (or some hacked version of f, if we need intermediate values)

save values for backward: save the required unexpanded arguments and intermediates

return the output (as opposed to all intermediate values)

Now, the reason why this code is confusing is that it's not clear what is being saved. We are saving all the args and kwargs, but we're also saving "aux outputs", which turn out to be nothing in this case.

It would make sense for this to be generic if we planned to refactor all of the autograd.Function forward passes to look the same. Is that a good idea?

For F.linear -- there is actually no need to save bias for the backward pass, and if input does not require gradient, then there is no need to save the input! (doesn't need to happen in this PR, but those are potential optimizations). So we might not want all the autograd.Function forward passes to look similar (especially because there are already checks specific to F.linear here)

If we decide that we want this to look generic, I'd probably recommend aux_outputs be renamed to intermediates.

If we decide we don't want this to look generic, It could read better as:

output, = forward_helper(F.linear, expanded_args, expanded_kwargs) ctx.unexpanded_weight = expanded_args[1] return output

For e.g. group_norm this could look like:

output, mean, rstd = forward_helper(torch.aten.ops.native_group_norm, expanded_args, expanded_kwargs) ctx.mean = mean ctx.rstd = rstd return output

The benefit of the non-generic form is that one doesn't have to go deep diving into the backward() part of the autograd.Function to see exactly what args, kwargs, aux_outputs are.

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights (tests in this version only test the fallback and one module, which will also call the fallback) - The rest is the implementation of the slow fallback + the mechanism for being able to register faster per sample grad rules. None of the faster rules are implemented here, but they are all implemented in #70141 [ghstack-poisoned]

samdow · 2022-02-15T15:52:32Z

@zou3519 All the comments make sense! I moved the output unpacking to be function specific and added the comments. Per offline discussion, _make_wrapper_subclass didn't work without a __torch_dispatch__ function

zou3519

Cool!

We should probably file a follow-up issue to see if we actually duplicate the memory when using _make_subclass

samdow · 2022-02-18T20:24:27Z

@samdow has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141 Differential Revision: [D34350950](https://our.internmc.facebook.com/intern/diff/D34350950) [ghstack-poisoned]

samdow · 2022-02-22T18:00:22Z

@samdow has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Pull Request resolved: #70140 [Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141 Test Plan: Imported from OSS Reviewed By: mikaylagawarecki Differential Revision: D34350950 Pulled By: samdow fbshipit-source-id: 69c664b0bc3dff6951358d79d7e5d94882f7aef2

github-actions · 2022-02-22T20:35:51Z

Hey @samdow.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

samdow · 2022-02-28T13:40:55Z

Added not user facing because we don't add release notes for prototype features. Proper tags to be added when this becomes beta

Summary: Pull Request resolved: pytorch/pytorch#70140 [Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141 Test Plan: Imported from OSS Reviewed By: mikaylagawarecki Differential Revision: D34350950 Pulled By: samdow fbshipit-source-id: 69c664b0bc3dff6951358d79d7e5d94882f7aef2 (cherry picked from commit ae1620d3b6507b27c3bc08ecfb2b1418aa8ce7d7)

expanded weights without fast rules

a3a2586

[ghstack-poisoned]

samdow requested review from albanD and jbschlosser as code owners December 17, 2021 23:18

pytorch-probot bot added the ciflow/default label Dec 17, 2021

facebook-github-bot added the cla signed label Dec 17, 2021

samdow mentioned this pull request Dec 17, 2021

expanded weights: instance norm faster rules #70141

Closed

samdow commented Dec 21, 2021

View reviewed changes

torch/expanded_weights/expanded_weights_impl.py Outdated Show resolved Hide resolved

samdow commented Dec 21, 2021

View reviewed changes

torch/expanded_weights/expanded_weights_impl.py Outdated Show resolved Hide resolved

samdow commented Dec 21, 2021

View reviewed changes

torch/expanded_weights/expanded_weights_impl.py Outdated Show resolved Hide resolved

samdow commented Dec 21, 2021

View reviewed changes

torch/expanded_weights/expanded_weights_impl.py Outdated Show resolved Hide resolved

samdow added 2 commits December 21, 2021 19:53

Update on "expanded weights without fast rules"

96e48a2

[ghstack-poisoned]

Update on "expanded weights without fast rules"

23213bb

[ghstack-poisoned]

samdow requested a review from zou3519 December 21, 2021 21:12

zou3519 reviewed Dec 21, 2021

View reviewed changes

test/test_expanded_weights.py Show resolved Hide resolved

zou3519 reviewed Dec 21, 2021

View reviewed changes

torch/_expanded_weights/expanded_weights_impl.py Outdated Show resolved Hide resolved

samdow commented Dec 22, 2021

View reviewed changes

torch/_expanded_weights/expanded_weights_impl.py Outdated Show resolved Hide resolved

zou3519 reviewed Dec 22, 2021

View reviewed changes

torch/_expanded_weights/expanded_weights_impl.py Outdated Show resolved Hide resolved