[PROTON] Refactor GPU profilers #4056

Jokeren · 2024-06-03T01:40:18Z

Extract duplicated code into GPUProfiler.h
Track finished correlation ids for both cupti and amd profilers

Jokeren · 2024-06-04T02:19:53Z

Waiting for @CRobeck's feedback

CRobeck

LGTM. Few minor comments.

CRobeck · 2024-06-04T14:15:04Z

third_party/proton/csrc/lib/Profiler/CuptiProfiler.cpp

-  }
-  cupti::activityFlushAll<true>(CUPTI_ACTIVITY_FLAG_FLUSH_FORCED);
+  profiler.correlation.flush(
+      /*maxRetries=*/100, /*sleepMs=*/10,


Did we do any testing of these maxRetries and sleepMs values? maxRetries is probably fine at 100 but is the 10 ms from a documented latency value from somewhere? Maybe worth a command line argument to set these?

There isn't a documentation for it. In general increase it may get more valid records in corner cases. Unfortunately we don't support passing configurations through the command line or python function in proton. We would like to keep a very small number of profiling knobs to avoid increasing the learning curve.

I think may could use environment variables to control it though.

I can leave it to you to do a bit investigation and add the support. What do you think?

Sure. Will look into it.

CRobeck · 2024-06-04T14:17:30Z

third_party/proton/csrc/lib/Profiler/RoctracerProfiler.cpp

-                                        std::set<Data *> &dataSet,
-                                        const roctracer_record_t *record) {
+void processActivity(std::mutex &corrIdToExternIdMutex,
+                     std::map<uint64_t, size_t> &corrIdToExternId,


Is this data type inconsistency with above expected? Is it a known difference that cupti uses uint32_t and roctracer uint64_t?

Yes, it's a known issue.

https://rocm.docs.amd.com/projects/roctracer/en/latest/reference/roctracer-spec.html#activity-apis

https://docs.nvidia.com/cupti/api/structCUpti__ActivityAPI.html#_CPPv4N17CUpti_ActivityAPI13correlationIdE

CRobeck · 2024-06-04T14:59:16Z

third_party/proton/csrc/lib/Profiler/RoctracerProfiler.cpp

      (symbol != NULL)
-          ? abi::__cxa_demangle(symbol, NULL, &funcnamesize, &status)
+          ? abi::__cxa_demangle(symbol, NULL, &funcNameSize, &status)


Nit: Maybe move this into a check around abi::__cxa_demangle:
if (const char* name = abi::__cxa_demangle(symbol, NULL, &funcnamesize, &status))
and then just return the unmangled name if unsuccessful.

CRobeck · 2024-06-04T15:43:34Z

third_party/proton/csrc/lib/Profiler/CuptiProfiler.cpp

@@ -173,10 +146,10 @@ struct CuptiProfiler::CuptiProfilerPimpl {
  static void callbackFn(void *userData, CUpti_CallbackDomain domain,
                         CUpti_CallbackId cbId, const void *cbData);

-  const inline static size_t AlignSize = 8;
-  const inline static size_t BufferSize = 64 * 1024 * 1024;
+  inline const static size_t AlignSize = 8;


Maybe constexpr instead of/in addition inline to here?

1. Extract duplicated code into GPUProfiler.h 2. Track finished correlation ids for both cupti and amd profilers (cherry picked from commit 328b86d)

Cherry picks for release/3.0.x General: - e8bc45d [BACKEND][AMD] Disable linear layout due to perf regression (#4126) - 9a0a7c2 [AMD] Add basic verification to MFMA encoding (#4117) for RDNA: - 100e2aa [AMD][WMMA] Support dot3d (#3674) - 4a1ea8e [AMD][gfx11] Fix BF16 wmma instr generation (#4135) Proton HIP PRs: - 328b86d [PROTON] Refactor GPU profilers (#4056) - 60613fb [PROTON] Roctracer: convert agent id to gpu id for gpu ops (#4090) - c1776fa [PROTON][AMD] Add Proton HIP GPU Utilization Metrics (#4119) --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> Co-authored-by: Ilya V <152324710+joviliast@users.noreply.github.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: mwootton <michael.wootton@amd.com> Co-authored-by: Corbin Robeck <corbin.robeck@amd.com>

Jokeren added 16 commits May 24, 2024 12:56

Update

f3461ea

Remove yield

6a80acc

Update

d92b078

Merge branch 'main' into keren/gpuprofiler

8f18ea7

Update

969fdbb

Update

d034c63

Update

883b5ad

Update

7958959

Update

455c984

Update

113f7e3

Update

6f1a233

Update

8e657b3

Update

1fb499a

Update

4dfbbd2

Update

8864cb0

Update

326da20

Jokeren marked this pull request as ready for review June 3, 2024 18:09

Jokeren requested a review from ptillet as a code owner June 3, 2024 18:09

Jokeren requested a review from antiagainst June 3, 2024 18:10

ptillet approved these changes Jun 4, 2024

View reviewed changes

CRobeck approved these changes Jun 4, 2024

View reviewed changes

Address comments

03f1efa

Jokeren merged commit 328b86d into main Jun 4, 2024
6 checks passed

Jokeren deleted the keren/gpuprofiler branch June 4, 2024 17:57

jlebar mentioned this pull request Jun 4, 2024

Use LLs in AsyncCopyGlobalToLocalOp lowering. #4070

Merged

jataylo mentioned this pull request Jun 20, 2024

[RELEASE] [AMD] Additional AMD cherry-picks #4175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROTON] Refactor GPU profilers #4056

[PROTON] Refactor GPU profilers #4056

Jokeren commented Jun 3, 2024

Jokeren commented Jun 4, 2024

CRobeck left a comment

CRobeck Jun 4, 2024

Jokeren Jun 4, 2024

CRobeck Jun 4, 2024

CRobeck Jun 4, 2024 •

edited

Loading

Jokeren Jun 4, 2024

CRobeck Jun 4, 2024

Jokeren Jun 4, 2024

CRobeck Jun 4, 2024

Jokeren Jun 4, 2024

[PROTON] Refactor GPU profilers #4056

[PROTON] Refactor GPU profilers #4056

Conversation

Jokeren commented Jun 3, 2024

Jokeren commented Jun 4, 2024

CRobeck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CRobeck Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CRobeck Jun 4, 2024 •

edited

Loading