Use libkineto in profiler #46470

ilia-cher · 2020-10-16T14:57:08Z

Stack from ghstack:

Output stacks (support for SVG visualization) #48438 Output stacks (support for SVG visualization)
Eager module attribution in profiler stack traces #48433 Eager module attribution in profiler stack traces
Add Kineto CI job #48391 Add Kineto CI job
New profiler API #48280 New profiler API
Use libkineto in profiler #46470 Use libkineto in profiler

Summary:
Adding ability to use Kineto (CUPTI) to profile CUDA kernels

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                      sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                            aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                            aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                          aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                    aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                        cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                  cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                               aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                           aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                       cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                              aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a

Differential Revision: D25142223

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: python test/test_profiler.py [ghstack-poisoned]

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: python test/test_profiler.py ghstack-source-id: cfaac5dc763c9b66633c7f861c125806a983d574 Pull Request resolved: #46470

dr-ci · 2020-10-16T14:58:31Z

💊 CI failures summary and remediations

As of commit ca6cb73 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

Extra GitHub checks: 1 failed

Failed: GitHub Actions - flake8-py3

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 507 times.

facebook-github-bot · 2020-10-16T14:59:26Z

💊 CI failures summary and remediations

As of commit a4d4124 (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 2/3 non-CircleCI failure(s)---

1 failure not recognized by patterns:

Job	Step	Action
^{binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build}	^{Checkout pytorch/builder repo}	🔁 rerun

Extra GitHub checks: 2 failed

Failed: GitHub Actions - flake8-py3
Failed: GitHub Actions - cmakelint

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 2 times.

ilia-cher · 2020-10-16T15:06:34Z

(wip)

ilia-cher · 2020-10-16T15:10:24Z

added profiler output when using libkineto (CUPTI)

torch/autograd/profiler.py

torch/csrc/autograd/profiler.cpp

dzhulgakov · 2020-10-20T19:45:00Z

torch/csrc/autograd/profiler.cpp

@@ -569,6 +648,44 @@ thread_event_lists disableProfiler(c10::optional<ProfilerDisableOptions> profile
    at::removeCallback(state_ptr->callbackHandle());
  }

+#ifdef USE_KINETO
+  if (state_ptr->config().state == ProfilerState::KINETO) {


nit: extract into a subfunction (it's pretty long)

torch/csrc/autograd/profiler.h

dzhulgakov · 2020-10-20T19:46:09Z

torch/csrc/autograd/profiler.h

@@ -248,6 +261,10 @@ struct TORCH_API Event final {
    return device_;
  }

+  void setDevice(int device) {


what is the meaning of device? why is it different from c10::Device. If it's just a gpu index - call it device_index. In any case - add a comment about semantics

Yes it's index. I think maybe device_id will be better than device_index, as it can potentially mean other things.

gdankel · 2020-10-22T05:08:21Z

torch/csrc/autograd/profiler.cpp

+  if (new_config.state == ProfilerState::KINETO) {
+    std::set<libkineto::ActivityType> k_activities;
+    if (activities.count(ActivityType::CPU)) {
+      k_activities.insert(libkineto::ActivityType::EXTERNAL_CORRELATION);


We probably want this together with CUDA_RUNTIME. I need to add another type for PyTorch events.

This provides links between CUDA runtime events and PyTorch observer events, so once adding a new type for PyTorch observer events (e.g. APPLICATION or CLIENT) then this can be enabled when both that and CUDA_RUNTIME is selected.
Another use case where it may be useful is if you want to focus only on PyTorch and GPU events, removing the CUDA_RUNTIME layer, but keeping the links. We can then link directly to PyTorch events. We are still required to collect runtime events to do this unfortunately so it won't reduce overhead, only trace size.

gdankel · 2020-10-22T05:11:30Z

torch/csrc/autograd/profiler.cpp

+          k_evt.threadId,
+          false,
+          k_evt.correlationId);
+      push_evt.setDevice(k_evt.deviceId);


Hmm... maybe we can add an adapter instead to avoid copying? Or do we want to copy anyway? Currently I keep the profiler "busy" until reset() is called, and the events returned is alive until then. We could add clone or similar if we need to extend the lifetime.

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 12.000us 63.16% 12.000us 12.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 2.750us 14.47% 2.750us 2.750us 1 Memcpy HtoD (Pagable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.250us 11.84% 2.250us 2.250us 1 Memcpy DtoH (Device -> Pagable) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 10.53% 2.000us 2.000us 1 aten::mm 25.87% 364.400ms 25.87% 364.426ms 364.426ms 0.000us 0.00% 0.000us 0.000us 1 aten::empty 0.00% 39.585us 0.00% 39.585us 19.792us 0.000us 0.00% 0.000us 0.000us 2 aten::stride 0.00% 3.363us 0.00% 3.363us 1.121us 0.000us 0.00% 0.000us 0.000us 3 aten::add 74.12% 1.044s 74.12% 1.044s 1.044s 0.000us 0.00% 0.000us 0.000us 1 aten::to 0.00% 13.155us 0.01% 116.398us 116.398us 0.000us 0.00% 0.000us 0.000us 1 aten::empty_strided 0.00% 30.365us 0.00% 30.365us 30.365us 0.000us 0.00% 0.000us 0.000us 1 aten::copy_ 0.01% 72.878us 0.01% 72.878us 72.878us 0.000us 0.00% 0.000us 0.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` [ghstack-poisoned]

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: python test/test_profiler.py ghstack-source-id: 43b4302098a8c269dcc0974850c4aff531e4ed10 Pull Request resolved: #46470

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 12.000us 63.16% 12.000us 12.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 2.750us 14.47% 2.750us 2.750us 1 Memcpy HtoD (Pagable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.250us 11.84% 2.250us 2.250us 1 Memcpy DtoH (Device -> Pagable) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 10.53% 2.000us 2.000us 1 aten::mm 25.87% 364.400ms 25.87% 364.426ms 364.426ms 0.000us 0.00% 0.000us 0.000us 1 aten::empty 0.00% 39.585us 0.00% 39.585us 19.792us 0.000us 0.00% 0.000us 0.000us 2 aten::stride 0.00% 3.363us 0.00% 3.363us 1.121us 0.000us 0.00% 0.000us 0.000us 3 aten::add 74.12% 1.044s 74.12% 1.044s 1.044s 0.000us 0.00% 0.000us 0.000us 1 aten::to 0.00% 13.155us 0.01% 116.398us 116.398us 0.000us 0.00% 0.000us 0.000us 1 aten::empty_strided 0.00% 30.365us 0.00% 30.365us 30.365us 0.000us 0.00% 0.000us 0.000us 1 aten::copy_ 0.01% 72.878us 0.01% 72.878us 72.878us 0.000us 0.00% 0.000us 0.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` [ghstack-poisoned]

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: python test/test_profiler.py ghstack-source-id: e53280eae1ed077e8dfc50a8f02b609b887c0bdb Pull Request resolved: #46470

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 12.000us 63.16% 12.000us 12.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 2.750us 14.47% 2.750us 2.750us 1 Memcpy HtoD (Pagable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.250us 11.84% 2.250us 2.250us 1 Memcpy DtoH (Device -> Pagable) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 10.53% 2.000us 2.000us 1 aten::mm 25.87% 364.400ms 25.87% 364.426ms 364.426ms 0.000us 0.00% 0.000us 0.000us 1 aten::empty 0.00% 39.585us 0.00% 39.585us 19.792us 0.000us 0.00% 0.000us 0.000us 2 aten::stride 0.00% 3.363us 0.00% 3.363us 1.121us 0.000us 0.00% 0.000us 0.000us 3 aten::add 74.12% 1.044s 74.12% 1.044s 1.044s 0.000us 0.00% 0.000us 0.000us 1 aten::to 0.00% 13.155us 0.01% 116.398us 116.398us 0.000us 0.00% 0.000us 0.000us 1 aten::empty_strided 0.00% 30.365us 0.00% 30.365us 30.365us 0.000us 0.00% 0.000us 0.000us 1 aten::copy_ 0.01% 72.878us 0.01% 72.878us 72.878us 0.000us 0.00% 0.000us 0.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` [ghstack-poisoned]

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: python test/test_profiler.py ghstack-source-id: 5485893dabb199fcfe130ad5da4bf8bf5a3c4a8b Pull Request resolved: #46470

torch/autograd/profiler.py

dzhulgakov · 2020-11-02T17:59:35Z

torch/csrc/autograd/profiler.cpp

@@ -169,6 +173,14 @@ struct FileLineFunc {
  std::string funcname;
 };

+thread_local size_t corr_id_ = 0;


so any conclusion on constructing id instead of introducing another thread local?

at the very least - can it be folded with other thread local vars?

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py python test/test_autograd.py -k test_profile python test/test_autograd.py -k test_record ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 1.000us 2 sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 2.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 Memcpy DtoH (Device -> Pageable) 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 aten::randn 5.17% 74.000us 6.71% 96.000us 48.000us 0.000us 0.00% 0.000us 0.000us 2 aten::empty 1.33% 19.000us 1.33% 19.000us 4.750us 0.000us 0.00% 0.000us 0.000us 4 aten::normal_ 1.05% 15.000us 1.05% 15.000us 7.500us 0.000us 0.00% 0.000us 0.000us 2 aten::to 77.90% 1.114ms 91.61% 1.310ms 436.667us 0.000us 0.00% 3.000us 1.000us 3 aten::empty_strided 2.52% 36.000us 2.52% 36.000us 12.000us 0.000us 0.00% 0.000us 0.000us 3 aten::copy_ 2.73% 39.000us 11.19% 160.000us 53.333us 0.000us 0.00% 3.000us 1.000us 3 cudaMemcpyAsync 4.34% 62.000us 4.34% 62.000us 20.667us 0.000us 0.00% 0.000us 0.000us 3 cudaStreamSynchronize 1.61% 23.000us 1.61% 23.000us 7.667us 0.000us 0.00% 0.000us 0.000us 3 aten::mm 0.21% 3.000us 7.20% 103.000us 103.000us 0.000us 0.00% 2.000us 2.000us 1 aten::stride 0.21% 3.000us 0.21% 3.000us 1.000us 0.000us 0.00% 0.000us 0.000us 3 cudaLaunchKernel 2.45% 35.000us 2.45% 35.000us 17.500us 0.000us 0.00% 0.000us 0.000us 2 aten::add 0.49% 7.000us 4.27% 61.000us 61.000us 0.000us 0.00% 1.000us 1.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a [ghstack-poisoned]

ilia-cher · 2020-11-20T20:15:47Z

example of a trace with profiler step information

ngimel

lgtm

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py python test/test_autograd.py -k test_profile python test/test_autograd.py -k test_record ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 1.000us 2 sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 2.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 Memcpy DtoH (Device -> Pageable) 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 aten::randn 5.17% 74.000us 6.71% 96.000us 48.000us 0.000us 0.00% 0.000us 0.000us 2 aten::empty 1.33% 19.000us 1.33% 19.000us 4.750us 0.000us 0.00% 0.000us 0.000us 4 aten::normal_ 1.05% 15.000us 1.05% 15.000us 7.500us 0.000us 0.00% 0.000us 0.000us 2 aten::to 77.90% 1.114ms 91.61% 1.310ms 436.667us 0.000us 0.00% 3.000us 1.000us 3 aten::empty_strided 2.52% 36.000us 2.52% 36.000us 12.000us 0.000us 0.00% 0.000us 0.000us 3 aten::copy_ 2.73% 39.000us 11.19% 160.000us 53.333us 0.000us 0.00% 3.000us 1.000us 3 cudaMemcpyAsync 4.34% 62.000us 4.34% 62.000us 20.667us 0.000us 0.00% 0.000us 0.000us 3 cudaStreamSynchronize 1.61% 23.000us 1.61% 23.000us 7.667us 0.000us 0.00% 0.000us 0.000us 3 aten::mm 0.21% 3.000us 7.20% 103.000us 103.000us 0.000us 0.00% 2.000us 2.000us 1 aten::stride 0.21% 3.000us 0.21% 3.000us 1.000us 0.000us 0.00% 0.000us 0.000us 3 cudaLaunchKernel 2.45% 35.000us 2.45% 35.000us 17.500us 0.000us 0.00% 0.000us 0.000us 2 aten::add 0.49% 7.000us 4.27% 61.000us 61.000us 0.000us 0.00% 1.000us 1.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a [ghstack-poisoned]

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py python test/test_autograd.py -k test_profile python test/test_autograd.py -k test_record ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 1.000us 2 sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 2.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 Memcpy DtoH (Device -> Pageable) 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 aten::randn 5.17% 74.000us 6.71% 96.000us 48.000us 0.000us 0.00% 0.000us 0.000us 2 aten::empty 1.33% 19.000us 1.33% 19.000us 4.750us 0.000us 0.00% 0.000us 0.000us 4 aten::normal_ 1.05% 15.000us 1.05% 15.000us 7.500us 0.000us 0.00% 0.000us 0.000us 2 aten::to 77.90% 1.114ms 91.61% 1.310ms 436.667us 0.000us 0.00% 3.000us 1.000us 3 aten::empty_strided 2.52% 36.000us 2.52% 36.000us 12.000us 0.000us 0.00% 0.000us 0.000us 3 aten::copy_ 2.73% 39.000us 11.19% 160.000us 53.333us 0.000us 0.00% 3.000us 1.000us 3 cudaMemcpyAsync 4.34% 62.000us 4.34% 62.000us 20.667us 0.000us 0.00% 0.000us 0.000us 3 cudaStreamSynchronize 1.61% 23.000us 1.61% 23.000us 7.667us 0.000us 0.00% 0.000us 0.000us 3 aten::mm 0.21% 3.000us 7.20% 103.000us 103.000us 0.000us 0.00% 2.000us 2.000us 1 aten::stride 0.21% 3.000us 0.21% 3.000us 1.000us 0.000us 0.00% 0.000us 0.000us 3 cudaLaunchKernel 2.45% 35.000us 2.45% 35.000us 17.500us 0.000us 0.00% 0.000us 0.000us 2 aten::add 0.49% 7.000us 4.27% 61.000us 61.000us 0.000us 0.00% 1.000us 1.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223) [ghstack-poisoned]

ilia-cher · 2020-11-24T01:13:34Z

all of CI is green (89 successful checks)

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py python test/test_autograd.py -k test_profile python test/test_autograd.py -k test_record ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 1.000us 2 sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 2.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 Memcpy DtoH (Device -> Pageable) 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 aten::randn 5.17% 74.000us 6.71% 96.000us 48.000us 0.000us 0.00% 0.000us 0.000us 2 aten::empty 1.33% 19.000us 1.33% 19.000us 4.750us 0.000us 0.00% 0.000us 0.000us 4 aten::normal_ 1.05% 15.000us 1.05% 15.000us 7.500us 0.000us 0.00% 0.000us 0.000us 2 aten::to 77.90% 1.114ms 91.61% 1.310ms 436.667us 0.000us 0.00% 3.000us 1.000us 3 aten::empty_strided 2.52% 36.000us 2.52% 36.000us 12.000us 0.000us 0.00% 0.000us 0.000us 3 aten::copy_ 2.73% 39.000us 11.19% 160.000us 53.333us 0.000us 0.00% 3.000us 1.000us 3 cudaMemcpyAsync 4.34% 62.000us 4.34% 62.000us 20.667us 0.000us 0.00% 0.000us 0.000us 3 cudaStreamSynchronize 1.61% 23.000us 1.61% 23.000us 7.667us 0.000us 0.00% 0.000us 0.000us 3 aten::mm 0.21% 3.000us 7.20% 103.000us 103.000us 0.000us 0.00% 2.000us 2.000us 1 aten::stride 0.21% 3.000us 0.21% 3.000us 1.000us 0.000us 0.00% 0.000us 0.000us 3 cudaLaunchKernel 2.45% 35.000us 2.45% 35.000us 17.500us 0.000us 0.00% 0.000us 0.000us 2 aten::add 0.49% 7.000us 4.27% 61.000us 61.000us 0.000us 0.00% 1.000us 1.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223) [ghstack-poisoned]

ilia-cher · 2020-11-24T06:19:23Z

(new lint failure is on files not in this PR)

ilia-cher · 2020-11-24T20:34:42Z

the linter seems to have picked up some errors in unrelated files (test/test_torch.py, torch/nn/modules/conv.py to name a few)

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py python test/test_autograd.py -k test_profile python test/test_autograd.py -k test_record ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 1.000us 2 sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 2.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 Memcpy DtoH (Device -> Pageable) 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 aten::randn 5.17% 74.000us 6.71% 96.000us 48.000us 0.000us 0.00% 0.000us 0.000us 2 aten::empty 1.33% 19.000us 1.33% 19.000us 4.750us 0.000us 0.00% 0.000us 0.000us 4 aten::normal_ 1.05% 15.000us 1.05% 15.000us 7.500us 0.000us 0.00% 0.000us 0.000us 2 aten::to 77.90% 1.114ms 91.61% 1.310ms 436.667us 0.000us 0.00% 3.000us 1.000us 3 aten::empty_strided 2.52% 36.000us 2.52% 36.000us 12.000us 0.000us 0.00% 0.000us 0.000us 3 aten::copy_ 2.73% 39.000us 11.19% 160.000us 53.333us 0.000us 0.00% 3.000us 1.000us 3 cudaMemcpyAsync 4.34% 62.000us 4.34% 62.000us 20.667us 0.000us 0.00% 0.000us 0.000us 3 cudaStreamSynchronize 1.61% 23.000us 1.61% 23.000us 7.667us 0.000us 0.00% 0.000us 0.000us 3 aten::mm 0.21% 3.000us 7.20% 103.000us 103.000us 0.000us 0.00% 2.000us 2.000us 1 aten::stride 0.21% 3.000us 0.21% 3.000us 1.000us 0.000us 0.00% 0.000us 0.000us 3 cudaLaunchKernel 2.45% 35.000us 2.45% 35.000us 17.500us 0.000us 0.00% 0.000us 0.000us 2 aten::add 0.49% 7.000us 4.27% 61.000us 61.000us 0.000us 0.00% 1.000us 1.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a Differential Revision: [D25142223](https://our.internmc.facebook.com/intern/diff/D25142223) [ghstack-poisoned]

albanD · 2021-02-02T22:37:02Z

test/test_autograd.py

            x = torch.randn(10, 10, requires_grad=True)
            y = torch.randn(10, 10, requires_grad=True)
            z = x + y
            s = z.sum()
            s.backward()
+        print(p.key_averages().table(


@ilia-cher is this a forgotten debugging print? can it be removed?
I see that a few of the profiler tests have print("") in them as well. Are these used?

it is convenient for debugging, i can remove these outputs if the tests are too spammy

im removing debug output in #51421

Use libkineto in profiler

a4d4124

Summary: Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: python test/test_profiler.py [ghstack-poisoned]

ilia-cher requested review from albanD and apaszke as code owners October 16, 2020 14:57

This was referenced Oct 16, 2020

Add Kineto submodule #45887

Closed

Add USE_KINETO build option #45888

Closed

ilia-cher requested a review from ngimel October 16, 2020 15:04

ilia-cher requested a review from robieta October 16, 2020 15:14

dzhulgakov self-requested a review October 19, 2020 20:05

dzhulgakov reviewed Oct 20, 2020

View reviewed changes

gdankel reviewed Oct 22, 2020

View reviewed changes

facebook-github-bot added the cla signed label Oct 30, 2020

ilia-cher requested review from mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 2, 2020 10:25

dzhulgakov reviewed Nov 2, 2020

View reviewed changes

ilia-cher added 2 commits November 20, 2020 11:35

ilia-cher requested a review from ngimel November 20, 2020 20:14

ngimel approved these changes Nov 20, 2020

View reviewed changes

ilia-cher added 9 commits November 21, 2020 00:05

ilia-cher mentioned this pull request Nov 23, 2020

Add Kineto CI job #48391

Closed

ilia-cher added 2 commits November 24, 2020 14:03

This was referenced Nov 25, 2020

Eager module attribution in profiler stack traces #48433

Merged

Output stacks (support for SVG visualization) #48438

Closed

facebook-github-bot closed this in f7a8bf2 Nov 25, 2020

facebook-github-bot deleted the gh/ilia-cher/55/head branch December 25, 2020 15:16

albanD reviewed Feb 2, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use libkineto in profiler #46470

Use libkineto in profiler #46470

ilia-cher commented Oct 16, 2020 •

edited

dr-ci bot commented Oct 16, 2020 •

edited

facebook-github-bot commented Oct 16, 2020 •

edited

ilia-cher commented Oct 16, 2020

ilia-cher commented Oct 16, 2020

dzhulgakov Oct 20, 2020

dzhulgakov Oct 20, 2020

gdankel Oct 22, 2020

gdankel Oct 22, 2020

gdankel Oct 22, 2020

gdankel Oct 22, 2020

dzhulgakov Nov 2, 2020

ilia-cher commented Nov 20, 2020

ngimel left a comment

ilia-cher commented Nov 24, 2020

ilia-cher commented Nov 24, 2020

ilia-cher commented Nov 24, 2020 •

edited

albanD Feb 2, 2021

ilia-cher Feb 4, 2021

ilia-cher Feb 4, 2021

Use libkineto in profiler #46470

Use libkineto in profiler #46470

Conversation

ilia-cher commented Oct 16, 2020 • edited

dr-ci bot commented Oct 16, 2020 • edited

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

facebook-github-bot commented Oct 16, 2020 • edited

💊 CI failures summary and remediations

1 failure not recognized by patterns:

Extra GitHub checks: 2 failed

ilia-cher commented Oct 16, 2020

ilia-cher commented Oct 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilia-cher commented Nov 20, 2020

ngimel left a comment

Choose a reason for hiding this comment

ilia-cher commented Nov 24, 2020

ilia-cher commented Nov 24, 2020

ilia-cher commented Nov 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilia-cher commented Oct 16, 2020 •

edited

dr-ci bot commented Oct 16, 2020 •

edited

facebook-github-bot commented Oct 16, 2020 •

edited

ilia-cher commented Nov 24, 2020 •

edited