New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't support legacy Python #2
Comments
colesbury
referenced
this issue
in colesbury/pytorch
Apr 28, 2017
apaszke
pushed a commit
that referenced
this issue
Apr 28, 2017
apaszke
pushed a commit
that referenced
this issue
May 1, 2017
Jiaming-Liu
pushed a commit
to Jiaming-Liu/pytorch
that referenced
this issue
May 18, 2017
tfriedel
pushed a commit
to tfriedel/pytorch
that referenced
this issue
Aug 9, 2017
soumith
pushed a commit
that referenced
this issue
Oct 5, 2017
Closed
williamwen42
added a commit
that referenced
this issue
Feb 5, 2024
…ard function is invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
williamwen42
added a commit
that referenced
this issue
Feb 5, 2024
…invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
williamwen42
added a commit
that referenced
this issue
Feb 6, 2024
…ard function is invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
williamwen42
added a commit
that referenced
this issue
Feb 6, 2024
…invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
jorgep31415
added a commit
that referenced
this issue
Feb 6, 2024
Pull Request resolved: #118835 We borrow MatMul's work to do the re-packing: https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50 # GLSL Change #1 - Reduce calls to `texelFetch(uKernel, ...)` by 4. In V2, this was the only change. We created an inner for-loop (which executes up to 4 times), and moved this call out. ``` for (int k = k_start; k < k_end;) { const ivec3 w_pos = ivec3(k / 4, in_c % in_group_size, out_c); const vec4 weight = texelFetch(uKernel, w_pos, 0); for (int k_off = k % 4; k_off < 4 && k < k_end; ++k, ++k_off) { int in_pos_x = in_l + k * dilation; const ivec3 in_pos = ivec3(in_pos_x, in_c, n / 4); const vec4 input_value = texelFetch(uInput, in_pos, 0); v += weight[k_off] * input_value; } } ``` However, it actually results in worse performance, because of the complex for-loop conditions, especially `int k_off = k % 4`. The compiler can't unroll this! # GLSL Change #2 - Unroll loops to `texelFetch(uInput, ...)`. The `k_start` and `k_end` "smartly" avoid computations that would result in a sum of zero. However, these theoretical gains lead to physical branching that cannot be optimized. ## W/o diff (690ms) ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {30, 20, 2} 35984 vulkan.nchw_to_image {32, 4, 3} 11128 vulkan.nchw_to_image {10, 1, 1} 6292 vulkan.conv1d {1, 10, 1} 669084 vulkan.image_to_nchw {2, 10, 2} 7748 vulkan.nchw_to_image {30, 20, 2} 31044 vulkan.nchw_to_image {32, 4, 3} 10868 vulkan.nchw_to_image {10, 1, 1} 6136 vulkan.conv1d {1, 10, 1} 671216 vulkan.image_to_nchw {2, 10, 2} 8164 vulkan.nchw_to_image {30, 20, 2} 31148 vulkan.nchw_to_image {32, 4, 3} 10920 vulkan.nchw_to_image {10, 1, 1} 6084 vulkan.conv1d {1, 10, 1} 674232 vulkan.image_to_nchw {2, 10, 2} 8008 vulkan.nchw_to_image {30, 20, 2} 31096 vulkan.nchw_to_image {32, 4, 3} 11024 vulkan.nchw_to_image {10, 1, 1} 6500 vulkan.conv1d {1, 10, 1} 671736 vulkan.image_to_nchw {2, 10, 2} 8164 vulkan.nchw_to_image {30, 20, 2} 31824 vulkan.nchw_to_image {32, 4, 3} 11284 vulkan.nchw_to_image {10, 1, 1} 6604 vulkan.conv1d {1, 10, 1} 691340 vulkan.image_to_nchw {2, 10, 2} 7644 ------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------- conv1d_op_benchmark/iterations:5/manual_time/threads:1 0.676 ms 35.0 ms 5 ``` ## W/ diff (330ms) ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {30, 20, 2} 35828 vulkan.nchw_to_image {32, 4, 3} 11024 vulkan.nchw_to_image {10, 1, 1} 6344 vulkan.convert_channels_to_width_packed {8, 4, 10} 13208 vulkan.conv1d {1, 10, 1} 326664 vulkan.image_to_nchw {2, 10, 2} 8164 vulkan.nchw_to_image {30, 20, 2} 30940 vulkan.nchw_to_image {32, 4, 3} 10972 vulkan.nchw_to_image {10, 1, 1} 6188 vulkan.convert_channels_to_width_packed {8, 4, 10} 12844 vulkan.conv1d {1, 10, 1} 326872 vulkan.image_to_nchw {2, 10, 2} 8112 vulkan.nchw_to_image {30, 20, 2} 31304 vulkan.nchw_to_image {32, 4, 3} 10972 vulkan.nchw_to_image {10, 1, 1} 6240 vulkan.convert_channels_to_width_packed {8, 4, 10} 12584 vulkan.conv1d {1, 10, 1} 323492 vulkan.image_to_nchw {2, 10, 2} 7488 vulkan.nchw_to_image {30, 20, 2} 31772 vulkan.nchw_to_image {32, 4, 3} 10868 vulkan.nchw_to_image {10, 1, 1} 6396 vulkan.convert_channels_to_width_packed {8, 4, 10} 13312 vulkan.conv1d {1, 10, 1} 332956 vulkan.image_to_nchw {2, 10, 2} 8216 vulkan.nchw_to_image {30, 20, 2} 31772 vulkan.nchw_to_image {32, 4, 3} 11024 vulkan.nchw_to_image {10, 1, 1} 6292 vulkan.convert_channels_to_width_packed {8, 4, 10} 13104 vulkan.conv1d {1, 10, 1} 330408 vulkan.image_to_nchw {2, 10, 2} 7592 ------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------- conv1d_op_benchmark/iterations:5/manual_time/threads:1 0.341 ms 41.0 ms 5 ``` ghstack-source-id: 214201402 @exported-using-ghexport Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
williamwen42
added a commit
that referenced
this issue
Feb 6, 2024
…ard function is invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
williamwen42
added a commit
that referenced
this issue
Feb 6, 2024
…invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
williamwen42
added a commit
that referenced
this issue
Feb 6, 2024
…ard function is invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
williamwen42
added a commit
that referenced
this issue
Feb 6, 2024
…invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
williamwen42
added a commit
that referenced
this issue
Feb 7, 2024
…ard function is invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
williamwen42
added a commit
that referenced
this issue
Feb 7, 2024
…invalidated [attempt 2]" Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]
pytorchmergebot
pushed a commit
that referenced
this issue
Feb 7, 2024
… [attempt 2] (#119107) Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. Pull Request resolved: #119107 Approved by: https://github.com/jansel
jorgep31415
added a commit
that referenced
this issue
Feb 7, 2024
Pull Request resolved: #118835 We borrow MatMul's work to do the re-packing: https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50 # GLSL Change #1 - Reduce calls to `texelFetch(uKernel, ...)` by 4. In V2, this was the only change. We created an inner for-loop (which executes up to 4 times), and moved this call out. ``` for (int k = k_start; k < k_end;) { const ivec3 w_pos = ivec3(k / 4, in_c % in_group_size, out_c); const vec4 weight = texelFetch(uKernel, w_pos, 0); for (int k_off = k % 4; k_off < 4 && k < k_end; ++k, ++k_off) { int in_pos_x = in_l + k * dilation; const ivec3 in_pos = ivec3(in_pos_x, in_c, n / 4); const vec4 input_value = texelFetch(uInput, in_pos, 0); v += weight[k_off] * input_value; } } ``` However, it actually results in worse performance, because of the complex for-loop conditions, especially `int k_off = k % 4`. The compiler can't unroll this! # GLSL Change #2 - Unroll loops to `texelFetch(uInput, ...)`. The `k_start` and `k_end` "smartly" avoid computations that would result in a sum of zero. However, these theoretical gains lead to physical branching that cannot be optimized. ## W/o diff (690ms) ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {30, 20, 2} 35984 vulkan.nchw_to_image {32, 4, 3} 11128 vulkan.nchw_to_image {10, 1, 1} 6292 vulkan.conv1d {1, 10, 1} 669084 vulkan.image_to_nchw {2, 10, 2} 7748 vulkan.nchw_to_image {30, 20, 2} 31044 vulkan.nchw_to_image {32, 4, 3} 10868 vulkan.nchw_to_image {10, 1, 1} 6136 vulkan.conv1d {1, 10, 1} 671216 vulkan.image_to_nchw {2, 10, 2} 8164 vulkan.nchw_to_image {30, 20, 2} 31148 vulkan.nchw_to_image {32, 4, 3} 10920 vulkan.nchw_to_image {10, 1, 1} 6084 vulkan.conv1d {1, 10, 1} 674232 vulkan.image_to_nchw {2, 10, 2} 8008 vulkan.nchw_to_image {30, 20, 2} 31096 vulkan.nchw_to_image {32, 4, 3} 11024 vulkan.nchw_to_image {10, 1, 1} 6500 vulkan.conv1d {1, 10, 1} 671736 vulkan.image_to_nchw {2, 10, 2} 8164 vulkan.nchw_to_image {30, 20, 2} 31824 vulkan.nchw_to_image {32, 4, 3} 11284 vulkan.nchw_to_image {10, 1, 1} 6604 vulkan.conv1d {1, 10, 1} 691340 vulkan.image_to_nchw {2, 10, 2} 7644 ------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------- conv1d_op_benchmark/iterations:5/manual_time/threads:1 0.676 ms 35.0 ms 5 ``` ## W/ diff (330ms) ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {30, 20, 2} 35828 vulkan.nchw_to_image {32, 4, 3} 11024 vulkan.nchw_to_image {10, 1, 1} 6344 vulkan.convert_channels_to_width_packed {8, 4, 10} 13208 vulkan.conv1d {1, 10, 1} 326664 vulkan.image_to_nchw {2, 10, 2} 8164 vulkan.nchw_to_image {30, 20, 2} 30940 vulkan.nchw_to_image {32, 4, 3} 10972 vulkan.nchw_to_image {10, 1, 1} 6188 vulkan.convert_channels_to_width_packed {8, 4, 10} 12844 vulkan.conv1d {1, 10, 1} 326872 vulkan.image_to_nchw {2, 10, 2} 8112 vulkan.nchw_to_image {30, 20, 2} 31304 vulkan.nchw_to_image {32, 4, 3} 10972 vulkan.nchw_to_image {10, 1, 1} 6240 vulkan.convert_channels_to_width_packed {8, 4, 10} 12584 vulkan.conv1d {1, 10, 1} 323492 vulkan.image_to_nchw {2, 10, 2} 7488 vulkan.nchw_to_image {30, 20, 2} 31772 vulkan.nchw_to_image {32, 4, 3} 10868 vulkan.nchw_to_image {10, 1, 1} 6396 vulkan.convert_channels_to_width_packed {8, 4, 10} 13312 vulkan.conv1d {1, 10, 1} 332956 vulkan.image_to_nchw {2, 10, 2} 8216 vulkan.nchw_to_image {30, 20, 2} 31772 vulkan.nchw_to_image {32, 4, 3} 11024 vulkan.nchw_to_image {10, 1, 1} 6292 vulkan.convert_channels_to_width_packed {8, 4, 10} 13104 vulkan.conv1d {1, 10, 1} 330408 vulkan.image_to_nchw {2, 10, 2} 7592 ------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------- conv1d_op_benchmark/iterations:5/manual_time/threads:1 0.341 ms 41.0 ms 5 ``` ghstack-source-id: 214424835 @exported-using-ghexport Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
jorgep31415
added a commit
that referenced
this issue
Feb 7, 2024
Pull Request resolved: #118835 We borrow MatMul's work to do the re-packing: https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50 # GLSL Change #1 - Reduce calls to `texelFetch(uKernel, ...)` by 4. In V2, this was the only change. We created an inner for-loop (which executes up to 4 times), and moved this call out. ``` for (int k = k_start; k < k_end;) { const ivec3 w_pos = ivec3(k / 4, in_c % in_group_size, out_c); const vec4 weight = texelFetch(uKernel, w_pos, 0); for (int k_off = k % 4; k_off < 4 && k < k_end; ++k, ++k_off) { int in_pos_x = in_l + k * dilation; const ivec3 in_pos = ivec3(in_pos_x, in_c, n / 4); const vec4 input_value = texelFetch(uInput, in_pos, 0); v += weight[k_off] * input_value; } } ``` However, it actually results in worse performance, because of the complex for-loop conditions, especially `int k_off = k % 4`. The compiler can't unroll this! # GLSL Change #2 - Unroll loops to `texelFetch(uInput, ...)`. The `k_start` and `k_end` "smartly" avoid computations that would result in a sum of zero. However, these theoretical gains lead to physical branching that cannot be optimized. ## W/o diff (690ms) ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {30, 20, 2} 35984 vulkan.nchw_to_image {32, 4, 3} 11128 vulkan.nchw_to_image {10, 1, 1} 6292 vulkan.conv1d {1, 10, 1} 669084 vulkan.image_to_nchw {2, 10, 2} 7748 vulkan.nchw_to_image {30, 20, 2} 31044 vulkan.nchw_to_image {32, 4, 3} 10868 vulkan.nchw_to_image {10, 1, 1} 6136 vulkan.conv1d {1, 10, 1} 671216 vulkan.image_to_nchw {2, 10, 2} 8164 vulkan.nchw_to_image {30, 20, 2} 31148 vulkan.nchw_to_image {32, 4, 3} 10920 vulkan.nchw_to_image {10, 1, 1} 6084 vulkan.conv1d {1, 10, 1} 674232 vulkan.image_to_nchw {2, 10, 2} 8008 vulkan.nchw_to_image {30, 20, 2} 31096 vulkan.nchw_to_image {32, 4, 3} 11024 vulkan.nchw_to_image {10, 1, 1} 6500 vulkan.conv1d {1, 10, 1} 671736 vulkan.image_to_nchw {2, 10, 2} 8164 vulkan.nchw_to_image {30, 20, 2} 31824 vulkan.nchw_to_image {32, 4, 3} 11284 vulkan.nchw_to_image {10, 1, 1} 6604 vulkan.conv1d {1, 10, 1} 691340 vulkan.image_to_nchw {2, 10, 2} 7644 ------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------- conv1d_op_benchmark/iterations:5/manual_time/threads:1 0.676 ms 35.0 ms 5 ``` ## W/ diff (330ms) ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {30, 20, 2} 35828 vulkan.nchw_to_image {32, 4, 3} 11024 vulkan.nchw_to_image {10, 1, 1} 6344 vulkan.convert_channels_to_width_packed {8, 4, 10} 13208 vulkan.conv1d {1, 10, 1} 326664 vulkan.image_to_nchw {2, 10, 2} 8164 vulkan.nchw_to_image {30, 20, 2} 30940 vulkan.nchw_to_image {32, 4, 3} 10972 vulkan.nchw_to_image {10, 1, 1} 6188 vulkan.convert_channels_to_width_packed {8, 4, 10} 12844 vulkan.conv1d {1, 10, 1} 326872 vulkan.image_to_nchw {2, 10, 2} 8112 vulkan.nchw_to_image {30, 20, 2} 31304 vulkan.nchw_to_image {32, 4, 3} 10972 vulkan.nchw_to_image {10, 1, 1} 6240 vulkan.convert_channels_to_width_packed {8, 4, 10} 12584 vulkan.conv1d {1, 10, 1} 323492 vulkan.image_to_nchw {2, 10, 2} 7488 vulkan.nchw_to_image {30, 20, 2} 31772 vulkan.nchw_to_image {32, 4, 3} 10868 vulkan.nchw_to_image {10, 1, 1} 6396 vulkan.convert_channels_to_width_packed {8, 4, 10} 13312 vulkan.conv1d {1, 10, 1} 332956 vulkan.image_to_nchw {2, 10, 2} 8216 vulkan.nchw_to_image {30, 20, 2} 31772 vulkan.nchw_to_image {32, 4, 3} 11024 vulkan.nchw_to_image {10, 1, 1} 6292 vulkan.convert_channels_to_width_packed {8, 4, 10} 13104 vulkan.conv1d {1, 10, 1} 330408 vulkan.image_to_nchw {2, 10, 2} 7592 ------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------- conv1d_op_benchmark/iterations:5/manual_time/threads:1 0.341 ms 41.0 ms 5 ``` ghstack-source-id: 214449871 @exported-using-ghexport Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
pytorch-bot bot
pushed a commit
that referenced
this issue
Feb 8, 2024
user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce`` ``` LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: " << get_python_cpp_trace(); ``` output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838`` ``` ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0 #1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0 #2 c10d::get_python_cpp_trace[abi:cxx11]() from :0 #3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0 #4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0 #5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0 #6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0 #7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0 #8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 #9 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 #10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543 #11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838 #15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75 #18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399 #21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308 #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332 #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448 #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413 #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839 #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520 #39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511 #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431 #44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494 #45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #47 inner from /data/users/weif/pytorch/run_fsdp.py:72 #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #50 run from /data/users/weif/pytorch/run_fsdp.py:76 #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #53 main from /data/users/weif/pytorch/run_fsdp.py:133 #54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #56 <module> from /data/users/weif/pytorch/run_fsdp.py:137 #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134 #59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291 #60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312 #61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208 #62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456 #63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90 #64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357 #65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090 #66 __libc_start_call_main from ??:0 #67 <unwind unsupported> from ??:0 ``` Pull Request resolved: #118924 Approved by: https://github.com/kwen2501
pytorch-bot bot
pushed a commit
that referenced
this issue
Feb 8, 2024
… [attempt 2] (#119107) Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. Pull Request resolved: #119107 Approved by: https://github.com/jansel
vfdev-5
pushed a commit
to vfdev-5/pytorch
that referenced
this issue
Feb 9, 2024
… [attempt 2] (pytorch#119107) Attempt pytorch#2 for pytorch#117875 to fix pytorch#112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. Pull Request resolved: pytorch#119107 Approved by: https://github.com/jansel
clee2000
pushed a commit
that referenced
this issue
Feb 14, 2024
… [attempt 2] (#119107) Attempt #2 for #117875 to fix #112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. Pull Request resolved: #119107 Approved by: https://github.com/jansel
atalman
added a commit
to atalman/pytorch
that referenced
this issue
Mar 13, 2024
atalman
added a commit
that referenced
this issue
Mar 13, 2024
* [Release only changes] Release only changes #2 * common+lint
guangy10
added a commit
that referenced
this issue
Mar 26, 2024
* [Release only changes] Release only changes #2 * common+lint [ghstack-poisoned]
chsivic
pushed a commit
to chsivic/pytorch
that referenced
this issue
Apr 16, 2024
Summary: The caffe2/utils threadpool impl used to set thread name, since D8266344 https://www.internalfb.com/code/fbsource/[3ba3d30d6841]/xplat/caffe2/caffe2/utils/threadpool/WorkersPool.h?lines=271-273 But now we don't use this caffe2's own impl (since D21232894?), but use the third-party threadpool instead, which doesn't set thread name This diff is to achieve same effect as D8266344, such that we can tell which threads are pytorch threads from perfetto trace. The idea comes from https://stackoverflow.com/questions/32375034/how-to-obtain-thread-name-in-android-ndk and folly ThreadName https://www.internalfb.com/code/fbsource/[3ba3d30d6841]/xplat/folly/system/ThreadName.cpp?lines=30-41 I'm not sure if this is the right place to put this change. BTW, Pytorch thread pool caller thread is worker #0 https://www.internalfb.com/code/fbsource/[3ba3d30d6841281c140db1c8bd2f85ede310a01b]/xplat/third-party/pthreadpool/pthreadpool/src/pthreads.c?lines=289-292 Test Plan: ## Before ``` --num_cpu_threads 2 --num_pytorch_threads -1 # default to size equal to 4 cpu cores mos:/ $ ps -T -p `pidof transcribe_bin` USER PID TID PPID VSZ RSS WCHAN ADDR S CMD shell 8985 8985 8983 118576 47688 hrtimer_n+ 0 S transcribe_bin <-- main thread shell 8985 8986 8983 118576 47688 0 0 R transcribe_bin <-- pytorch thread pytorch#1 shell 8985 8987 8983 118576 47688 0 0 R transcribe_bin <-- pytorch thread pytorch#2 shell 8985 8988 8983 118576 47688 0 0 R transcribe_bin <-- pytorch thread pytorch#3 shell 8985 8989 8983 118576 47688 0 0 R CPUThreadPool0 shell 8985 8990 8983 118576 47688 futex_wai+ 0 S CPUThreadPool1 shell 8985 8991 8983 118576 47688 ep_poll 0 S IOThreadPool0 shell 8985 8992 8983 118576 47688 futex_wai+ 0 S FutureTimekeepr shell 8985 8993 8983 118576 47688 pipe_wait 0 S snapshot_thread shell 8985 8994 8983 118576 47688 hrtimer_n+ 0 S snapshot_thread shell 8985 8997 8983 118576 47688 futex_wai+ 0 S AsyncDataQueue ``` ## After ``` --num_cpu_threads 2 --num_pytorch_threads -1 mos:/ $ ps -T -p `pidof transcribe_bin` USER PID TID PPID VSZ RSS WCHAN ADDR S CMD shell 11901 11901 11899 118128 40748 futex_wai+ 0 S transcribe_bin <-- main thread serves as pytorch thread #0 shell 11901 11902 11899 118132 40748 futex_wai+ 0 S c10pthreadpool <-- pytorch thread pytorch#1 shell 11901 11903 11899 118132 40748 futex_wai+ 0 S c10pthreadpool <-- pytorch thread pytorch#2 shell 11901 11904 11899 118132 40748 futex_wai+ 0 S c10pthreadpool <-- pytorch thread pytorch#3 shell 11901 11905 11899 118152 40752 futex_wai+ 0 S CPUThreadPool0 shell 11901 11906 11899 118148 40752 0 0 R CPUThreadPool1 shell 11901 11907 11899 118148 40756 ep_poll 0 S IOThreadPool0 shell 11901 11908 11899 118152 40756 futex_wai+ 0 S FutureTimekeepr shell 11901 11909 11899 118164 40756 pipe_wait 0 S snapshot_thread shell 11901 11910 11899 118168 40756 hrtimer_n+ 0 S snapshot_thread shell 11901 11913 11899 118160 40760 futex_wai+ 0 S AsyncDataQueue ``` Example Perfetto trace: {F1483727859} Looks like the pytorch thread pool was originally created with 4 thread during ASR loading (`loadTunaFactory`), and later recreated with 3 threads during inference. Differential Revision: D55990584 Pulled By: chsivic
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There is really no reason to support Python 2. Python 3 has been out for 8 years now. There are plenty of good articles written about this. Maintaining a dual codebase is a going to be a major pain and it prevents you from using a whole bunch of new Python 3 features (
six
only gets you so far).The text was updated successfully, but these errors were encountered: