Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't support legacy Python #2

Closed
bartvm opened this issue Aug 16, 2016 · 5 comments
Closed

Don't support legacy Python #2

bartvm opened this issue Aug 16, 2016 · 5 comments

Comments

@bartvm
Copy link
Contributor

bartvm commented Aug 16, 2016

There is really no reason to support Python 2. Python 3 has been out for 8 years now. There are plenty of good articles written about this. Maintaining a dual codebase is a going to be a major pain and it prevents you from using a whole bunch of new Python 3 features (six only gets you so far).

@apaszke
Copy link
Contributor

apaszke commented Sep 9, 2016

This looks like a reason to me:

pip downloads

pip scientific downloads

@soumith soumith closed this as completed Sep 10, 2016
Jiaming-Liu pushed a commit to Jiaming-Liu/pytorch that referenced this issue May 18, 2017
tfriedel pushed a commit to tfriedel/pytorch that referenced this issue Aug 9, 2017
tfriedel pushed a commit to tfriedel/pytorch that referenced this issue Aug 9, 2017
soumith pushed a commit that referenced this issue Oct 5, 2017
williamwen42 added a commit that referenced this issue Feb 5, 2024
…ard function is invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
williamwen42 added a commit that referenced this issue Feb 5, 2024
…invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
williamwen42 added a commit that referenced this issue Feb 6, 2024
…ard function is invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
williamwen42 added a commit that referenced this issue Feb 6, 2024
…invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
jorgep31415 added a commit that referenced this issue Feb 6, 2024
Pull Request resolved: #118835

We borrow MatMul's work to do the re-packing:

https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50

# GLSL Change #1 - Reduce calls to `texelFetch(uKernel, ...)` by 4.
In V2, this was the only change. We created an inner for-loop (which executes up to 4 times), and moved this call out.
```
for (int k = k_start; k < k_end;) {
  const ivec3 w_pos = ivec3(k / 4, in_c % in_group_size, out_c);
  const vec4 weight = texelFetch(uKernel, w_pos, 0);

  for (int k_off = k % 4; k_off < 4 && k < k_end; ++k, ++k_off) {
    int in_pos_x = in_l + k * dilation;
    const ivec3 in_pos = ivec3(in_pos_x, in_c, n / 4);
    const vec4 input_value = texelFetch(uInput, in_pos, 0);

     v += weight[k_off] * input_value;
  }
}
```

However, it actually results in worse performance, because of the complex for-loop conditions, especially `int k_off = k % 4`. The compiler can't unroll this!
# GLSL Change #2 - Unroll loops to `texelFetch(uInput, ...)`.

The `k_start` and `k_end` "smartly" avoid computations that would result in a sum of zero. However, these theoretical gains lead to physical branching that cannot be optimized.

## W/o diff (690ms)
```
Kernel Name                             Workgroup Size             Duration (ns)
===========                             ==============               ===========
vulkan.nchw_to_image                    {30, 20, 2}                        35984
vulkan.nchw_to_image                    {32, 4, 3}                         11128
vulkan.nchw_to_image                    {10, 1, 1}                          6292
vulkan.conv1d                           {1, 10, 1}                        669084
vulkan.image_to_nchw                    {2, 10, 2}                          7748
vulkan.nchw_to_image                    {30, 20, 2}                        31044
vulkan.nchw_to_image                    {32, 4, 3}                         10868
vulkan.nchw_to_image                    {10, 1, 1}                          6136
vulkan.conv1d                           {1, 10, 1}                        671216
vulkan.image_to_nchw                    {2, 10, 2}                          8164
vulkan.nchw_to_image                    {30, 20, 2}                        31148
vulkan.nchw_to_image                    {32, 4, 3}                         10920
vulkan.nchw_to_image                    {10, 1, 1}                          6084
vulkan.conv1d                           {1, 10, 1}                        674232
vulkan.image_to_nchw                    {2, 10, 2}                          8008
vulkan.nchw_to_image                    {30, 20, 2}                        31096
vulkan.nchw_to_image                    {32, 4, 3}                         11024
vulkan.nchw_to_image                    {10, 1, 1}                          6500
vulkan.conv1d                           {1, 10, 1}                        671736
vulkan.image_to_nchw                    {2, 10, 2}                          8164
vulkan.nchw_to_image                    {30, 20, 2}                        31824
vulkan.nchw_to_image                    {32, 4, 3}                         11284
vulkan.nchw_to_image                    {10, 1, 1}                          6604
vulkan.conv1d                           {1, 10, 1}                        691340
vulkan.image_to_nchw                    {2, 10, 2}                          7644

-------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations
-------------------------------------------------------------------------------------------------
conv1d_op_benchmark/iterations:5/manual_time/threads:1      0.676 ms         35.0 ms            5
```

## W/ diff (330ms)
```
Kernel Name                             Workgroup Size             Duration (ns)
===========                             ==============               ===========
vulkan.nchw_to_image                    {30, 20, 2}                        35828
vulkan.nchw_to_image                    {32, 4, 3}                         11024
vulkan.nchw_to_image                    {10, 1, 1}                          6344
vulkan.convert_channels_to_width_packed {8, 4, 10}                         13208
vulkan.conv1d                           {1, 10, 1}                        326664
vulkan.image_to_nchw                    {2, 10, 2}                          8164
vulkan.nchw_to_image                    {30, 20, 2}                        30940
vulkan.nchw_to_image                    {32, 4, 3}                         10972
vulkan.nchw_to_image                    {10, 1, 1}                          6188
vulkan.convert_channels_to_width_packed {8, 4, 10}                         12844
vulkan.conv1d                           {1, 10, 1}                        326872
vulkan.image_to_nchw                    {2, 10, 2}                          8112
vulkan.nchw_to_image                    {30, 20, 2}                        31304
vulkan.nchw_to_image                    {32, 4, 3}                         10972
vulkan.nchw_to_image                    {10, 1, 1}                          6240
vulkan.convert_channels_to_width_packed {8, 4, 10}                         12584
vulkan.conv1d                           {1, 10, 1}                        323492
vulkan.image_to_nchw                    {2, 10, 2}                          7488
vulkan.nchw_to_image                    {30, 20, 2}                        31772
vulkan.nchw_to_image                    {32, 4, 3}                         10868
vulkan.nchw_to_image                    {10, 1, 1}                          6396
vulkan.convert_channels_to_width_packed {8, 4, 10}                         13312
vulkan.conv1d                           {1, 10, 1}                        332956
vulkan.image_to_nchw                    {2, 10, 2}                          8216
vulkan.nchw_to_image                    {30, 20, 2}                        31772
vulkan.nchw_to_image                    {32, 4, 3}                         11024
vulkan.nchw_to_image                    {10, 1, 1}                          6292
vulkan.convert_channels_to_width_packed {8, 4, 10}                         13104
vulkan.conv1d                           {1, 10, 1}                        330408
vulkan.image_to_nchw                    {2, 10, 2}                          7592

-------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations
-------------------------------------------------------------------------------------------------
conv1d_op_benchmark/iterations:5/manual_time/threads:1      0.341 ms         41.0 ms            5
```

ghstack-source-id: 214201402
@exported-using-ghexport

Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
williamwen42 added a commit that referenced this issue Feb 6, 2024
…ard function is invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
williamwen42 added a commit that referenced this issue Feb 6, 2024
…invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
williamwen42 added a commit that referenced this issue Feb 6, 2024
…ard function is invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
williamwen42 added a commit that referenced this issue Feb 6, 2024
…invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
williamwen42 added a commit that referenced this issue Feb 7, 2024
…ard function is invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
williamwen42 added a commit that referenced this issue Feb 7, 2024
…invalidated [attempt 2]"


Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue Feb 7, 2024
… [attempt 2] (#119107)

Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

Pull Request resolved: #119107
Approved by: https://github.com/jansel
jorgep31415 added a commit that referenced this issue Feb 7, 2024
Pull Request resolved: #118835

We borrow MatMul's work to do the re-packing:

https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50

# GLSL Change #1 - Reduce calls to `texelFetch(uKernel, ...)` by 4.
In V2, this was the only change. We created an inner for-loop (which executes up to 4 times), and moved this call out.
```
for (int k = k_start; k < k_end;) {
  const ivec3 w_pos = ivec3(k / 4, in_c % in_group_size, out_c);
  const vec4 weight = texelFetch(uKernel, w_pos, 0);

  for (int k_off = k % 4; k_off < 4 && k < k_end; ++k, ++k_off) {
    int in_pos_x = in_l + k * dilation;
    const ivec3 in_pos = ivec3(in_pos_x, in_c, n / 4);
    const vec4 input_value = texelFetch(uInput, in_pos, 0);

     v += weight[k_off] * input_value;
  }
}
```

However, it actually results in worse performance, because of the complex for-loop conditions, especially `int k_off = k % 4`. The compiler can't unroll this!
# GLSL Change #2 - Unroll loops to `texelFetch(uInput, ...)`.

The `k_start` and `k_end` "smartly" avoid computations that would result in a sum of zero. However, these theoretical gains lead to physical branching that cannot be optimized.

## W/o diff (690ms)
```
Kernel Name                             Workgroup Size             Duration (ns)
===========                             ==============               ===========
vulkan.nchw_to_image                    {30, 20, 2}                        35984
vulkan.nchw_to_image                    {32, 4, 3}                         11128
vulkan.nchw_to_image                    {10, 1, 1}                          6292
vulkan.conv1d                           {1, 10, 1}                        669084
vulkan.image_to_nchw                    {2, 10, 2}                          7748
vulkan.nchw_to_image                    {30, 20, 2}                        31044
vulkan.nchw_to_image                    {32, 4, 3}                         10868
vulkan.nchw_to_image                    {10, 1, 1}                          6136
vulkan.conv1d                           {1, 10, 1}                        671216
vulkan.image_to_nchw                    {2, 10, 2}                          8164
vulkan.nchw_to_image                    {30, 20, 2}                        31148
vulkan.nchw_to_image                    {32, 4, 3}                         10920
vulkan.nchw_to_image                    {10, 1, 1}                          6084
vulkan.conv1d                           {1, 10, 1}                        674232
vulkan.image_to_nchw                    {2, 10, 2}                          8008
vulkan.nchw_to_image                    {30, 20, 2}                        31096
vulkan.nchw_to_image                    {32, 4, 3}                         11024
vulkan.nchw_to_image                    {10, 1, 1}                          6500
vulkan.conv1d                           {1, 10, 1}                        671736
vulkan.image_to_nchw                    {2, 10, 2}                          8164
vulkan.nchw_to_image                    {30, 20, 2}                        31824
vulkan.nchw_to_image                    {32, 4, 3}                         11284
vulkan.nchw_to_image                    {10, 1, 1}                          6604
vulkan.conv1d                           {1, 10, 1}                        691340
vulkan.image_to_nchw                    {2, 10, 2}                          7644

-------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations
-------------------------------------------------------------------------------------------------
conv1d_op_benchmark/iterations:5/manual_time/threads:1      0.676 ms         35.0 ms            5
```

## W/ diff (330ms)
```
Kernel Name                             Workgroup Size             Duration (ns)
===========                             ==============               ===========
vulkan.nchw_to_image                    {30, 20, 2}                        35828
vulkan.nchw_to_image                    {32, 4, 3}                         11024
vulkan.nchw_to_image                    {10, 1, 1}                          6344
vulkan.convert_channels_to_width_packed {8, 4, 10}                         13208
vulkan.conv1d                           {1, 10, 1}                        326664
vulkan.image_to_nchw                    {2, 10, 2}                          8164
vulkan.nchw_to_image                    {30, 20, 2}                        30940
vulkan.nchw_to_image                    {32, 4, 3}                         10972
vulkan.nchw_to_image                    {10, 1, 1}                          6188
vulkan.convert_channels_to_width_packed {8, 4, 10}                         12844
vulkan.conv1d                           {1, 10, 1}                        326872
vulkan.image_to_nchw                    {2, 10, 2}                          8112
vulkan.nchw_to_image                    {30, 20, 2}                        31304
vulkan.nchw_to_image                    {32, 4, 3}                         10972
vulkan.nchw_to_image                    {10, 1, 1}                          6240
vulkan.convert_channels_to_width_packed {8, 4, 10}                         12584
vulkan.conv1d                           {1, 10, 1}                        323492
vulkan.image_to_nchw                    {2, 10, 2}                          7488
vulkan.nchw_to_image                    {30, 20, 2}                        31772
vulkan.nchw_to_image                    {32, 4, 3}                         10868
vulkan.nchw_to_image                    {10, 1, 1}                          6396
vulkan.convert_channels_to_width_packed {8, 4, 10}                         13312
vulkan.conv1d                           {1, 10, 1}                        332956
vulkan.image_to_nchw                    {2, 10, 2}                          8216
vulkan.nchw_to_image                    {30, 20, 2}                        31772
vulkan.nchw_to_image                    {32, 4, 3}                         11024
vulkan.nchw_to_image                    {10, 1, 1}                          6292
vulkan.convert_channels_to_width_packed {8, 4, 10}                         13104
vulkan.conv1d                           {1, 10, 1}                        330408
vulkan.image_to_nchw                    {2, 10, 2}                          7592

-------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations
-------------------------------------------------------------------------------------------------
conv1d_op_benchmark/iterations:5/manual_time/threads:1      0.341 ms         41.0 ms            5
```

ghstack-source-id: 214424835
@exported-using-ghexport

Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
jorgep31415 added a commit that referenced this issue Feb 7, 2024
Pull Request resolved: #118835

We borrow MatMul's work to do the re-packing:

https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50

# GLSL Change #1 - Reduce calls to `texelFetch(uKernel, ...)` by 4.
In V2, this was the only change. We created an inner for-loop (which executes up to 4 times), and moved this call out.
```
for (int k = k_start; k < k_end;) {
  const ivec3 w_pos = ivec3(k / 4, in_c % in_group_size, out_c);
  const vec4 weight = texelFetch(uKernel, w_pos, 0);

  for (int k_off = k % 4; k_off < 4 && k < k_end; ++k, ++k_off) {
    int in_pos_x = in_l + k * dilation;
    const ivec3 in_pos = ivec3(in_pos_x, in_c, n / 4);
    const vec4 input_value = texelFetch(uInput, in_pos, 0);

     v += weight[k_off] * input_value;
  }
}
```

However, it actually results in worse performance, because of the complex for-loop conditions, especially `int k_off = k % 4`. The compiler can't unroll this!
# GLSL Change #2 - Unroll loops to `texelFetch(uInput, ...)`.

The `k_start` and `k_end` "smartly" avoid computations that would result in a sum of zero. However, these theoretical gains lead to physical branching that cannot be optimized.

## W/o diff (690ms)
```
Kernel Name                             Workgroup Size             Duration (ns)
===========                             ==============               ===========
vulkan.nchw_to_image                    {30, 20, 2}                        35984
vulkan.nchw_to_image                    {32, 4, 3}                         11128
vulkan.nchw_to_image                    {10, 1, 1}                          6292
vulkan.conv1d                           {1, 10, 1}                        669084
vulkan.image_to_nchw                    {2, 10, 2}                          7748
vulkan.nchw_to_image                    {30, 20, 2}                        31044
vulkan.nchw_to_image                    {32, 4, 3}                         10868
vulkan.nchw_to_image                    {10, 1, 1}                          6136
vulkan.conv1d                           {1, 10, 1}                        671216
vulkan.image_to_nchw                    {2, 10, 2}                          8164
vulkan.nchw_to_image                    {30, 20, 2}                        31148
vulkan.nchw_to_image                    {32, 4, 3}                         10920
vulkan.nchw_to_image                    {10, 1, 1}                          6084
vulkan.conv1d                           {1, 10, 1}                        674232
vulkan.image_to_nchw                    {2, 10, 2}                          8008
vulkan.nchw_to_image                    {30, 20, 2}                        31096
vulkan.nchw_to_image                    {32, 4, 3}                         11024
vulkan.nchw_to_image                    {10, 1, 1}                          6500
vulkan.conv1d                           {1, 10, 1}                        671736
vulkan.image_to_nchw                    {2, 10, 2}                          8164
vulkan.nchw_to_image                    {30, 20, 2}                        31824
vulkan.nchw_to_image                    {32, 4, 3}                         11284
vulkan.nchw_to_image                    {10, 1, 1}                          6604
vulkan.conv1d                           {1, 10, 1}                        691340
vulkan.image_to_nchw                    {2, 10, 2}                          7644

-------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations
-------------------------------------------------------------------------------------------------
conv1d_op_benchmark/iterations:5/manual_time/threads:1      0.676 ms         35.0 ms            5
```

## W/ diff (330ms)
```
Kernel Name                             Workgroup Size             Duration (ns)
===========                             ==============               ===========
vulkan.nchw_to_image                    {30, 20, 2}                        35828
vulkan.nchw_to_image                    {32, 4, 3}                         11024
vulkan.nchw_to_image                    {10, 1, 1}                          6344
vulkan.convert_channels_to_width_packed {8, 4, 10}                         13208
vulkan.conv1d                           {1, 10, 1}                        326664
vulkan.image_to_nchw                    {2, 10, 2}                          8164
vulkan.nchw_to_image                    {30, 20, 2}                        30940
vulkan.nchw_to_image                    {32, 4, 3}                         10972
vulkan.nchw_to_image                    {10, 1, 1}                          6188
vulkan.convert_channels_to_width_packed {8, 4, 10}                         12844
vulkan.conv1d                           {1, 10, 1}                        326872
vulkan.image_to_nchw                    {2, 10, 2}                          8112
vulkan.nchw_to_image                    {30, 20, 2}                        31304
vulkan.nchw_to_image                    {32, 4, 3}                         10972
vulkan.nchw_to_image                    {10, 1, 1}                          6240
vulkan.convert_channels_to_width_packed {8, 4, 10}                         12584
vulkan.conv1d                           {1, 10, 1}                        323492
vulkan.image_to_nchw                    {2, 10, 2}                          7488
vulkan.nchw_to_image                    {30, 20, 2}                        31772
vulkan.nchw_to_image                    {32, 4, 3}                         10868
vulkan.nchw_to_image                    {10, 1, 1}                          6396
vulkan.convert_channels_to_width_packed {8, 4, 10}                         13312
vulkan.conv1d                           {1, 10, 1}                        332956
vulkan.image_to_nchw                    {2, 10, 2}                          8216
vulkan.nchw_to_image                    {30, 20, 2}                        31772
vulkan.nchw_to_image                    {32, 4, 3}                         11024
vulkan.nchw_to_image                    {10, 1, 1}                          6292
vulkan.convert_channels_to_width_packed {8, 4, 10}                         13104
vulkan.conv1d                           {1, 10, 1}                        330408
vulkan.image_to_nchw                    {2, 10, 2}                          7592

-------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations
-------------------------------------------------------------------------------------------------
conv1d_op_benchmark/iterations:5/manual_time/threads:1      0.341 ms         41.0 ms            5
```

ghstack-source-id: 214449871
@exported-using-ghexport

Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
pytorch-bot bot pushed a commit that referenced this issue Feb 8, 2024
user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce``

```
LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: "
                       << get_python_cpp_trace();
```

output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838``
```
ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0
#1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0
#2 c10d::get_python_cpp_trace[abi:cxx11]() from :0
#3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0
#4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0
#5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0
#8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0
#9 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543
#11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838
#15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75
#18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399
#21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308
#24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332
#27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448
#30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413
#33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839
#36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520
#39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511
#42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431
#44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494
#45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#47 inner from /data/users/weif/pytorch/run_fsdp.py:72
#48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#50 run from /data/users/weif/pytorch/run_fsdp.py:76
#51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#53 main from /data/users/weif/pytorch/run_fsdp.py:133
#54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#56 <module> from /data/users/weif/pytorch/run_fsdp.py:137
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134
#59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291
#60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312
#61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208
#62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456
#63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90
#64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357
#65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090
#66 __libc_start_call_main from ??:0
#67 <unwind unsupported> from ??:0
```

Pull Request resolved: #118924
Approved by: https://github.com/kwen2501
pytorch-bot bot pushed a commit that referenced this issue Feb 8, 2024
… [attempt 2] (#119107)

Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

Pull Request resolved: #119107
Approved by: https://github.com/jansel
pytorch-bot bot pushed a commit that referenced this issue Feb 9, 2024
vfdev-5 pushed a commit to vfdev-5/pytorch that referenced this issue Feb 9, 2024
… [attempt 2] (pytorch#119107)

Attempt pytorch#2 for pytorch#117875 to fix pytorch#112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

Pull Request resolved: pytorch#119107
Approved by: https://github.com/jansel
clee2000 pushed a commit that referenced this issue Feb 14, 2024
… [attempt 2] (#119107)

Attempt #2 for #117875 to fix #112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

Pull Request resolved: #119107
Approved by: https://github.com/jansel
atalman added a commit to atalman/pytorch that referenced this issue Mar 13, 2024
atalman added a commit that referenced this issue Mar 13, 2024
* [Release only changes] Release only changes #2

* common+lint
guangy10 added a commit that referenced this issue Mar 26, 2024
* [Release only changes] Release only changes #2

* common+lint

[ghstack-poisoned]
chsivic pushed a commit to chsivic/pytorch that referenced this issue Apr 16, 2024
Summary:
The caffe2/utils threadpool impl used to set thread name, since D8266344
https://www.internalfb.com/code/fbsource/[3ba3d30d6841]/xplat/caffe2/caffe2/utils/threadpool/WorkersPool.h?lines=271-273

But now we don't use this caffe2's own impl (since D21232894?), but use the third-party threadpool instead, which doesn't set thread name

This diff is to achieve same effect as D8266344, such that we can tell which threads are pytorch threads from perfetto trace.

The idea comes from https://stackoverflow.com/questions/32375034/how-to-obtain-thread-name-in-android-ndk and folly ThreadName
https://www.internalfb.com/code/fbsource/[3ba3d30d6841]/xplat/folly/system/ThreadName.cpp?lines=30-41

I'm not sure if this is the right place to put this change.


BTW, Pytorch thread pool caller thread is worker #0

https://www.internalfb.com/code/fbsource/[3ba3d30d6841281c140db1c8bd2f85ede310a01b]/xplat/third-party/pthreadpool/pthreadpool/src/pthreads.c?lines=289-292


Test Plan:
## Before

```
--num_cpu_threads 2 --num_pytorch_threads -1     # default to size equal to 4 cpu cores
mos:/ $ ps -T -p `pidof transcribe_bin`
USER            PID   TID   PPID     VSZ    RSS WCHAN            ADDR S CMD
shell          8985  8985   8983  118576  47688 hrtimer_n+          0 S transcribe_bin        <-- main thread
shell          8985  8986   8983  118576  47688 0                   0 R transcribe_bin         <-- pytorch thread pytorch#1
shell          8985  8987   8983  118576  47688 0                   0 R transcribe_bin         <-- pytorch thread pytorch#2
shell          8985  8988   8983  118576  47688 0                   0 R transcribe_bin         <-- pytorch thread pytorch#3
shell          8985  8989   8983  118576  47688 0                   0 R CPUThreadPool0
shell          8985  8990   8983  118576  47688 futex_wai+          0 S CPUThreadPool1
shell          8985  8991   8983  118576  47688 ep_poll             0 S IOThreadPool0
shell          8985  8992   8983  118576  47688 futex_wai+          0 S FutureTimekeepr
shell          8985  8993   8983  118576  47688 pipe_wait           0 S snapshot_thread
shell          8985  8994   8983  118576  47688 hrtimer_n+          0 S snapshot_thread
shell          8985  8997   8983  118576  47688 futex_wai+          0 S AsyncDataQueue
```

## After
```
--num_cpu_threads 2 --num_pytorch_threads -1
mos:/ $ ps -T -p `pidof transcribe_bin`
USER            PID   TID   PPID     VSZ    RSS WCHAN            ADDR S CMD
shell         11901 11901  11899  118128  40748 futex_wai+          0 S transcribe_bin         <-- main thread serves as pytorch thread #0
shell         11901 11902  11899  118132  40748 futex_wai+          0 S c10pthreadpool         <-- pytorch thread pytorch#1
shell         11901 11903  11899  118132  40748 futex_wai+          0 S c10pthreadpool         <-- pytorch thread pytorch#2
shell         11901 11904  11899  118132  40748 futex_wai+          0 S c10pthreadpool         <-- pytorch thread pytorch#3
shell         11901 11905  11899  118152  40752 futex_wai+          0 S CPUThreadPool0
shell         11901 11906  11899  118148  40752 0                   0 R CPUThreadPool1
shell         11901 11907  11899  118148  40756 ep_poll             0 S IOThreadPool0
shell         11901 11908  11899  118152  40756 futex_wai+          0 S FutureTimekeepr
shell         11901 11909  11899  118164  40756 pipe_wait           0 S snapshot_thread
shell         11901 11910  11899  118168  40756 hrtimer_n+          0 S snapshot_thread
shell         11901 11913  11899  118160  40760 futex_wai+          0 S AsyncDataQueue
```

Example Perfetto trace:

 {F1483727859} 
Looks like the pytorch thread pool was originally created with 4 thread during ASR loading (`loadTunaFactory`), and later recreated with 3 threads during inference.

Differential Revision: D55990584

Pulled By: chsivic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants