Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use libkineto in profiler #46470

Closed
wants to merge 91 commits into from
Closed

Commits on Oct 16, 2020

  1. Use libkineto in profiler

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    python test/test_profiler.py
    
    [ghstack-poisoned]
    ilia-cher committed Oct 16, 2020
    Configuration menu
    Copy the full SHA
    a4d4124 View commit details
    Browse the repository at this point in the history

Commits on Oct 27, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Oct 27, 2020
    Configuration menu
    Copy the full SHA
    e27f74c View commit details
    Browse the repository at this point in the history

Commits on Nov 2, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    5c3833e View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    662431b View commit details
    Browse the repository at this point in the history
  3. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    ea956aa View commit details
    Browse the repository at this point in the history
  4. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    7dfdbc9 View commit details
    Browse the repository at this point in the history
  5. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    6725778 View commit details
    Browse the repository at this point in the history
  6. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    e9a219b View commit details
    Browse the repository at this point in the history
  7. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    49a9fee View commit details
    Browse the repository at this point in the history
  8. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    8edb346 View commit details
    Browse the repository at this point in the history
  9. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 2, 2020
    Configuration menu
    Copy the full SHA
    f288623 View commit details
    Browse the repository at this point in the history

Commits on Nov 3, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    979cdfa View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    c8cbeb0 View commit details
    Browse the repository at this point in the history
  3. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    226089c View commit details
    Browse the repository at this point in the history
  4. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    266b75f View commit details
    Browse the repository at this point in the history
  5. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    6958eac View commit details
    Browse the repository at this point in the history
  6. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    97e5070 View commit details
    Browse the repository at this point in the history
  7. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    8d111d2 View commit details
    Browse the repository at this point in the history
  8. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    bfb0360 View commit details
    Browse the repository at this point in the history
  9. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    1ff1a12 View commit details
    Browse the repository at this point in the history
  10. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    b3b69d8 View commit details
    Browse the repository at this point in the history
  11. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    2faeb8a View commit details
    Browse the repository at this point in the history
  12. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    67c890d View commit details
    Browse the repository at this point in the history
  13. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    ed8babe View commit details
    Browse the repository at this point in the history
  14. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    ffc11fd View commit details
    Browse the repository at this point in the history
  15. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    fe76b84 View commit details
    Browse the repository at this point in the history
  16. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    76ee80c View commit details
    Browse the repository at this point in the history
  17. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    5761ea2 View commit details
    Browse the repository at this point in the history
  18. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    dde5ec3 View commit details
    Browse the repository at this point in the history
  19. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    3a25bd2 View commit details
    Browse the repository at this point in the history
  20. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    6023998 View commit details
    Browse the repository at this point in the history
  21. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    0bc66a6 View commit details
    Browse the repository at this point in the history
  22. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    aa2d09e View commit details
    Browse the repository at this point in the history
  23. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    91718ac View commit details
    Browse the repository at this point in the history
  24. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    1556a7c View commit details
    Browse the repository at this point in the history
  25. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    4a0fec9 View commit details
    Browse the repository at this point in the history
  26. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    bb6396a View commit details
    Browse the repository at this point in the history
  27. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    60b5dee View commit details
    Browse the repository at this point in the history
  28. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    e1a5480 View commit details
    Browse the repository at this point in the history
  29. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    38a37dd View commit details
    Browse the repository at this point in the history
  30. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    c6c6039 View commit details
    Browse the repository at this point in the history
  31. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    17767d1 View commit details
    Browse the repository at this point in the history
  32. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 3, 2020
    Configuration menu
    Copy the full SHA
    3537e9d View commit details
    Browse the repository at this point in the history

Commits on Nov 4, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 4, 2020
    Configuration menu
    Copy the full SHA
    043dcd2 View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 4, 2020
    Configuration menu
    Copy the full SHA
    aa17339 View commit details
    Browse the repository at this point in the history

Commits on Nov 11, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    9262f92 View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    8371b33 View commit details
    Browse the repository at this point in the history
  3. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    67d4acb View commit details
    Browse the repository at this point in the history
  4. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    9f1d24f View commit details
    Browse the repository at this point in the history
  5. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    e864205 View commit details
    Browse the repository at this point in the history
  6. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    380b874 View commit details
    Browse the repository at this point in the history
  7. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    165bb7c View commit details
    Browse the repository at this point in the history

Commits on Nov 12, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 12, 2020
    Configuration menu
    Copy the full SHA
    445b8c1 View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 12, 2020
    Configuration menu
    Copy the full SHA
    7c317f5 View commit details
    Browse the repository at this point in the history
  3. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls       Node ID
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      11.000us        64.71%      11.000us      11.000us             1             0
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us        17.65%       3.000us       3.000us             1             0
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        11.76%       2.000us       2.000us             1             0
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         5.88%       1.000us       1.000us             1             0
                                                   aten::mm        13.86%     421.014ms        27.73%     842.019ms     421.010ms       0.000us         0.00%       0.000us       0.000us             2             0
                                                aten::empty         0.00%      25.000us         0.00%      25.000us      12.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                               aten::stride         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             3             0
                                                  aten::add        36.55%        1.110s        73.11%        2.220s        1.110s       0.000us         0.00%       0.000us       0.000us             2             0
                                                   aten::to         0.00%       9.000us         0.00%      99.000us      99.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                        aten::empty_strided         0.00%      21.000us         0.00%      21.000us      21.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                                aten::copy_         0.00%      69.000us         0.00%     133.000us      66.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                                   cudaFree        13.00%     394.907ms        13.00%     394.907ms     394.907ms       0.000us         0.00%       0.000us       0.000us             1             0
                                     cudaDeviceGetAttribute         0.00%       1.000us         0.00%       1.000us       0.091us       0.000us         0.00%       0.000us       0.000us            11             0
                                                 cudaMalloc         0.02%     632.000us         0.02%     632.000us     210.667us       0.000us         0.00%       0.000us       0.000us             3             0
                                                 cudaMemcpy         0.00%      20.000us         0.00%      20.000us      20.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                   cudaEventCreateWithFlags         0.00%       9.000us         0.00%       9.000us       0.562us       0.000us         0.00%       0.000us       0.000us            16             0
                                           cudaLaunchKernel        36.55%        1.110s        36.55%        1.110s     555.021ms       0.000us         0.00%       0.000us       0.000us             2             0
                                            cudaMemcpyAsync         0.00%      33.000us         0.00%      33.000us      33.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                      cudaStreamSynchronize         0.00%       4.000us         0.00%       4.000us       4.000us       0.000us         0.00%       0.000us       0.000us             1             0
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 12, 2020
    Configuration menu
    Copy the full SHA
    c904443 View commit details
    Browse the repository at this point in the history
  4. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls       Node ID
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      11.000us        64.71%      11.000us      11.000us             1             0
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us        17.65%       3.000us       3.000us             1             0
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        11.76%       2.000us       2.000us             1             0
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         5.88%       1.000us       1.000us             1             0
                                                   aten::mm        13.86%     421.014ms        27.73%     842.019ms     421.010ms       0.000us         0.00%       0.000us       0.000us             2             0
                                                aten::empty         0.00%      25.000us         0.00%      25.000us      12.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                               aten::stride         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             3             0
                                                  aten::add        36.55%        1.110s        73.11%        2.220s        1.110s       0.000us         0.00%       0.000us       0.000us             2             0
                                                   aten::to         0.00%       9.000us         0.00%      99.000us      99.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                        aten::empty_strided         0.00%      21.000us         0.00%      21.000us      21.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                                aten::copy_         0.00%      69.000us         0.00%     133.000us      66.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                                   cudaFree        13.00%     394.907ms        13.00%     394.907ms     394.907ms       0.000us         0.00%       0.000us       0.000us             1             0
                                     cudaDeviceGetAttribute         0.00%       1.000us         0.00%       1.000us       0.091us       0.000us         0.00%       0.000us       0.000us            11             0
                                                 cudaMalloc         0.02%     632.000us         0.02%     632.000us     210.667us       0.000us         0.00%       0.000us       0.000us             3             0
                                                 cudaMemcpy         0.00%      20.000us         0.00%      20.000us      20.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                   cudaEventCreateWithFlags         0.00%       9.000us         0.00%       9.000us       0.562us       0.000us         0.00%       0.000us       0.000us            16             0
                                           cudaLaunchKernel        36.55%        1.110s        36.55%        1.110s     555.021ms       0.000us         0.00%       0.000us       0.000us             2             0
                                            cudaMemcpyAsync         0.00%      33.000us         0.00%      33.000us      33.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                      cudaStreamSynchronize         0.00%       4.000us         0.00%       4.000us       4.000us       0.000us         0.00%       0.000us       0.000us             1             0
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 12, 2020
    Configuration menu
    Copy the full SHA
    1f600f8 View commit details
    Browse the repository at this point in the history
  5. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls       Node ID
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      11.000us        64.71%      11.000us      11.000us             1             0
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us        17.65%       3.000us       3.000us             1             0
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        11.76%       2.000us       2.000us             1             0
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         5.88%       1.000us       1.000us             1             0
                                                   aten::mm        13.86%     421.014ms        27.73%     842.019ms     421.010ms       0.000us         0.00%       0.000us       0.000us             2             0
                                                aten::empty         0.00%      25.000us         0.00%      25.000us      12.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                               aten::stride         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             3             0
                                                  aten::add        36.55%        1.110s        73.11%        2.220s        1.110s       0.000us         0.00%       0.000us       0.000us             2             0
                                                   aten::to         0.00%       9.000us         0.00%      99.000us      99.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                        aten::empty_strided         0.00%      21.000us         0.00%      21.000us      21.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                                aten::copy_         0.00%      69.000us         0.00%     133.000us      66.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                                   cudaFree        13.00%     394.907ms        13.00%     394.907ms     394.907ms       0.000us         0.00%       0.000us       0.000us             1             0
                                     cudaDeviceGetAttribute         0.00%       1.000us         0.00%       1.000us       0.091us       0.000us         0.00%       0.000us       0.000us            11             0
                                                 cudaMalloc         0.02%     632.000us         0.02%     632.000us     210.667us       0.000us         0.00%       0.000us       0.000us             3             0
                                                 cudaMemcpy         0.00%      20.000us         0.00%      20.000us      20.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                   cudaEventCreateWithFlags         0.00%       9.000us         0.00%       9.000us       0.562us       0.000us         0.00%       0.000us       0.000us            16             0
                                           cudaLaunchKernel        36.55%        1.110s        36.55%        1.110s     555.021ms       0.000us         0.00%       0.000us       0.000us             2             0
                                            cudaMemcpyAsync         0.00%      33.000us         0.00%      33.000us      33.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                      cudaStreamSynchronize         0.00%       4.000us         0.00%       4.000us       4.000us       0.000us         0.00%       0.000us       0.000us             1             0
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 12, 2020
    Configuration menu
    Copy the full SHA
    5aacc1c View commit details
    Browse the repository at this point in the history
  6. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls       Node ID
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      11.000us        64.71%      11.000us      11.000us             1             0
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us        17.65%       3.000us       3.000us             1             0
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        11.76%       2.000us       2.000us             1             0
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         5.88%       1.000us       1.000us             1             0
                                                   aten::mm        13.86%     421.014ms        27.73%     842.019ms     421.010ms       0.000us         0.00%       0.000us       0.000us             2             0
                                                aten::empty         0.00%      25.000us         0.00%      25.000us      12.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                               aten::stride         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             3             0
                                                  aten::add        36.55%        1.110s        73.11%        2.220s        1.110s       0.000us         0.00%       0.000us       0.000us             2             0
                                                   aten::to         0.00%       9.000us         0.00%      99.000us      99.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                        aten::empty_strided         0.00%      21.000us         0.00%      21.000us      21.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                                aten::copy_         0.00%      69.000us         0.00%     133.000us      66.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                                   cudaFree        13.00%     394.907ms        13.00%     394.907ms     394.907ms       0.000us         0.00%       0.000us       0.000us             1             0
                                     cudaDeviceGetAttribute         0.00%       1.000us         0.00%       1.000us       0.091us       0.000us         0.00%       0.000us       0.000us            11             0
                                                 cudaMalloc         0.02%     632.000us         0.02%     632.000us     210.667us       0.000us         0.00%       0.000us       0.000us             3             0
                                                 cudaMemcpy         0.00%      20.000us         0.00%      20.000us      20.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                   cudaEventCreateWithFlags         0.00%       9.000us         0.00%       9.000us       0.562us       0.000us         0.00%       0.000us       0.000us            16             0
                                           cudaLaunchKernel        36.55%        1.110s        36.55%        1.110s     555.021ms       0.000us         0.00%       0.000us       0.000us             2             0
                                            cudaMemcpyAsync         0.00%      33.000us         0.00%      33.000us      33.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                      cudaStreamSynchronize         0.00%       4.000us         0.00%       4.000us       4.000us       0.000us         0.00%       0.000us       0.000us             1             0
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 12, 2020
    Configuration menu
    Copy the full SHA
    651f556 View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us        63.16%      12.000us      12.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.750us        14.47%       2.750us       2.750us             1
                            Memcpy HtoD (Pagable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.250us        11.84%       2.250us       2.250us             1
                            Memcpy DtoH (Device -> Pagable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        10.53%       2.000us       2.000us             1
                                                   aten::mm        25.87%     364.400ms        25.87%     364.426ms     364.426ms       0.000us         0.00%       0.000us       0.000us             1
                                                aten::empty         0.00%      39.585us         0.00%      39.585us      19.792us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::stride         0.00%       3.363us         0.00%       3.363us       1.121us       0.000us         0.00%       0.000us       0.000us             3
                                                  aten::add        74.12%        1.044s        74.12%        1.044s        1.044s       0.000us         0.00%       0.000us       0.000us             1
                                                   aten::to         0.00%      13.155us         0.01%     116.398us     116.398us       0.000us         0.00%       0.000us       0.000us             1
                                        aten::empty_strided         0.00%      30.365us         0.00%      30.365us      30.365us       0.000us         0.00%       0.000us       0.000us             1
                                                aten::copy_         0.01%      72.878us         0.01%      72.878us      72.878us       0.000us         0.00%       0.000us       0.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls       Node ID
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us      11.000us        64.71%      11.000us      11.000us             1             0
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us        17.65%       3.000us       3.000us             1             0
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        11.76%       2.000us       2.000us             1             0
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         5.88%       1.000us       1.000us             1             0
                                                   aten::mm        13.86%     421.014ms        27.73%     842.019ms     421.010ms       0.000us         0.00%       0.000us       0.000us             2             0
                                                aten::empty         0.00%      25.000us         0.00%      25.000us      12.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                               aten::stride         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             3             0
                                                  aten::add        36.55%        1.110s        73.11%        2.220s        1.110s       0.000us         0.00%       0.000us       0.000us             2             0
                                                   aten::to         0.00%       9.000us         0.00%      99.000us      99.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                        aten::empty_strided         0.00%      21.000us         0.00%      21.000us      21.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                                aten::copy_         0.00%      69.000us         0.00%     133.000us      66.500us       0.000us         0.00%       0.000us       0.000us             2             0
                                                   cudaFree        13.00%     394.907ms        13.00%     394.907ms     394.907ms       0.000us         0.00%       0.000us       0.000us             1             0
                                     cudaDeviceGetAttribute         0.00%       1.000us         0.00%       1.000us       0.091us       0.000us         0.00%       0.000us       0.000us            11             0
                                                 cudaMalloc         0.02%     632.000us         0.02%     632.000us     210.667us       0.000us         0.00%       0.000us       0.000us             3             0
                                                 cudaMemcpy         0.00%      20.000us         0.00%      20.000us      20.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                   cudaEventCreateWithFlags         0.00%       9.000us         0.00%       9.000us       0.562us       0.000us         0.00%       0.000us       0.000us            16             0
                                           cudaLaunchKernel        36.55%        1.110s        36.55%        1.110s     555.021ms       0.000us         0.00%       0.000us       0.000us             2             0
                                            cudaMemcpyAsync         0.00%      33.000us         0.00%      33.000us      33.000us       0.000us         0.00%       0.000us       0.000us             1             0
                                      cudaStreamSynchronize         0.00%       4.000us         0.00%       4.000us       4.000us       0.000us         0.00%       0.000us       0.000us             1             0
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    9997011 View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    cfd0424 View commit details
    Browse the repository at this point in the history
  3. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    30114d8 View commit details
    Browse the repository at this point in the history
  4. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    27e4e9c View commit details
    Browse the repository at this point in the history
  5. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    bde96f6 View commit details
    Browse the repository at this point in the history
  6. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    b1a0292 View commit details
    Browse the repository at this point in the history
  7. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    b7fda07 View commit details
    Browse the repository at this point in the history

Commits on Nov 17, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    459df8e View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    09a4762 View commit details
    Browse the repository at this point in the history
  3. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    cafee0f View commit details
    Browse the repository at this point in the history
  4. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    39ff2b3 View commit details
    Browse the repository at this point in the history
  5. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 17, 2020
    Configuration menu
    Copy the full SHA
    5502837 View commit details
    Browse the repository at this point in the history

Commits on Nov 20, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 20, 2020
    Configuration menu
    Copy the full SHA
    7c2017b View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 20, 2020
    Configuration menu
    Copy the full SHA
    525e5b5 View commit details
    Browse the repository at this point in the history
  3. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 20, 2020
    Configuration menu
    Copy the full SHA
    1f50e4b View commit details
    Browse the repository at this point in the history
  4. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    [ghstack-poisoned]
    ilia-cher committed Nov 20, 2020
    Configuration menu
    Copy the full SHA
    f70a95c View commit details
    Browse the repository at this point in the history
  5. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    [ghstack-poisoned]
    ilia-cher committed Nov 20, 2020
    Configuration menu
    Copy the full SHA
    5fed8be View commit details
    Browse the repository at this point in the history
  6. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    [ghstack-poisoned]
    ilia-cher committed Nov 20, 2020
    Configuration menu
    Copy the full SHA
    2494879 View commit details
    Browse the repository at this point in the history
  7. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    [ghstack-poisoned]
    ilia-cher committed Nov 20, 2020
    Configuration menu
    Copy the full SHA
    4f401ff View commit details
    Browse the repository at this point in the history

Commits on Nov 21, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    [ghstack-poisoned]
    ilia-cher committed Nov 21, 2020
    Configuration menu
    Copy the full SHA
    c689e6b View commit details
    Browse the repository at this point in the history

Commits on Nov 22, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 22, 2020
    Configuration menu
    Copy the full SHA
    4a5632f View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 22, 2020
    Configuration menu
    Copy the full SHA
    6d0e7ab View commit details
    Browse the repository at this point in the history

Commits on Nov 23, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 23, 2020
    Configuration menu
    Copy the full SHA
    95b686f View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 23, 2020
    Configuration menu
    Copy the full SHA
    d98a5fb View commit details
    Browse the repository at this point in the history
  3. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 23, 2020
    Configuration menu
    Copy the full SHA
    cb7367e View commit details
    Browse the repository at this point in the history
  4. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 23, 2020
    Configuration menu
    Copy the full SHA
    5ad0a34 View commit details
    Browse the repository at this point in the history
  5. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 23, 2020
    Configuration menu
    Copy the full SHA
    d6bd96e View commit details
    Browse the repository at this point in the history
  6. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 23, 2020
    Configuration menu
    Copy the full SHA
    0c4faaa View commit details
    Browse the repository at this point in the history
  7. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 23, 2020
    Configuration menu
    Copy the full SHA
    ab754e1 View commit details
    Browse the repository at this point in the history

Commits on Nov 24, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 24, 2020
    Configuration menu
    Copy the full SHA
    aee38e8 View commit details
    Browse the repository at this point in the history
  2. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 24, 2020
    Configuration menu
    Copy the full SHA
    671785f View commit details
    Browse the repository at this point in the history
  3. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 24, 2020
    Configuration menu
    Copy the full SHA
    8fde042 View commit details
    Browse the repository at this point in the history

Commits on Nov 25, 2020

  1. Update on "Use libkineto in profiler"

    Summary:
    Adding ability to use Kineto (CUPTI) to profile CUDA kernels
    
    Test Plan:
    USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
    python test/test_profiler.py
    
    python test/test_autograd.py -k test_profile
    python test/test_autograd.py -k test_record
    
    ```
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                           Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                          sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
    void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                           Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                                aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                                aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                              aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                                   aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                        aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                                aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                            cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                      cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                                   aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                               aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                           cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                                  aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
    -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    ```
    
    benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
    
    Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223)
    
    [ghstack-poisoned]
    ilia-cher committed Nov 25, 2020
    Configuration menu
    Copy the full SHA
    ca6cb73 View commit details
    Browse the repository at this point in the history