[Core] Enable multiple profiler consumers and add a timeline/tracing profiler #1788

froce · 2023-09-18T01:43:50Z

PR Details

Description

This PR adds a profiler outputting data in chrome://tracing format (for now) and changes Profiler to make that possible.

ChromeTracingProfileWriter consumes events and dumps them to a file which can then be analyzed with tools like chrome://tracing, perfetto or speedscope.

Usage

//Start
ChromeTracingProfileWriter.Start(string outputPath, bool indentOutput = false)
//Stop
ChromeTracingProfiler.Stop()

//To add another output format/client and receive events use
Profiler.Subscribe() and Profiler.Unsubscribe()

Related Issue

N/A

Motivation and Context

I had trouble pinpointing performance issues with both the existing Profiler and external tools like dotTrace, so I started experimenting with adding something in the vein of chrome://tracing or tracy. In particular the exisiting Profiler is quite heavy-weight and doesn't play well with multithreading, while everything happening inside the ThreadPool / Dispatcher is difficult to understand with sampling profilers as a lot of work happens in anonymous functions.

The TraceProfiler lets you see work on all threads and lets you look at individual frames to analyze spikes instead of just averages. For me this was also very helpful in understanding the structure of the engine.

An example trace from a rendering stresstest opened in perfetto:

The example trace:
tracing_example.json.gz

Open Questions / Todo

Decide what to do with this. I've already found this very useful compared with the existing Profiler, so I'd like to see it added, but of course having multiple internal profilers is not ideal. Merging it with Profiler seemed quite challenging, but maybe it's a bit easier after the refactoring in Refactor Profiler and submit metrics to .NET Diagnostics API #1774 and I should take another look.
Add tests.
Update outdated comments and add new ones where needed.
Verify performance.
Update docs.

Types of changes

Docs change / refactoring / dependency upgrade
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

My change requires a change to the documentation.
I have added tests to cover my changes.
All new and existing tests passed.

manio143 · 2023-09-18T05:59:37Z

That is very interesting. I want to see this as an opt-in. And given the implementation is not large I think we could merge it into the source code.
What do you think about implementing a hook into the existing Profiler?
Something like

public delegate void EventProcessedHandler(ref ProfilingEvent profilingEvent, ProfilingEventType eventType);
public static event EventProcessedHandler EventProcessed;

// at the end of Profiler.ProcessEvent
EventProcessed?.Invoke(ref profilingEvent, eventType);

// In your implementation a new static method
TraceProfiler.StartCollectingEvents() => Profiler.EventProcessed += OnProfilingEventProcessed;
TraceProfiler.StopCollectingEvents() => Profiler.EventProcessed -= OnProfilingEventProcessed;

And then just add more ProfilingKeys to places where they were missing - like you did with your implementation specifically?

And if there's something in the current Profiler implementation that prevents it from being that simple - maybe we can look at why is that?

manio143 · 2023-09-18T06:09:46Z

We can easily add caller info into the Profiler. Other than that I see mainly that you're adding thread info and maintaining a per thread stack in order to stop the Stopwatch in order for Begin/End pairs (already done by ProfilingState).
The last difference - instead of using a per frame ring buffer to store events you use async channels to stream them to a file.

jazzay · 2023-09-18T14:44:35Z

I'm curious how we ensure profiling call sites get stripped for release/shipping builds?

froce · 2023-09-18T17:42:31Z

And then just add more ProfilingKeys to places where they were missing - like you did with your implementation specifically?

And if there's something in the current Profiler implementation that prevents it from being that simple - maybe we can look at why is that?

My first concern here would be performance. How big of a deal that is really depends on what you're profiling: A few hundred scopes spread around should be fine, thousands of scopes in a small function that gets called by worker threads is perf death by synchronization overhead.

As a quick best-cast vs worst-case example this is just collecting 100k events, without any output, either in a regular for loop or in a Parallel.For:

BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19045.3324)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.7.23376.3
  [Host]     : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2
  Job-JPLRAA : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2

InvocationCount=1  UnrollFactor=1

Method	Mean	Error	StdDev	Ratio	RatioSD
StrideProfiler	42.977 ms	0.7726 ms	0.6032 ms	3.85	0.35
Trace	11.706 ms	0.6019 ms	1.7556 ms	1.00	0.00
StrideProfilerParallel	77.534 ms	1.5432 ms	2.8985 ms	6.86	0.75
TraceParallel	2.490 ms	0.1216 ms	0.3566 ms	0.22	0.05

I also thought that adding lower level profiling keys would just clutter the existing profiling UI.
Maybe adding a ProfilingKeyFlags.TracingOnly would be a good option.

I think my other issues were really about Profiler itself and shouldn't affect this PR, your PR has already improved it and the rest can wait for a follow-up.
I'm also not a fan of how adding/removing ProfilingKeys even for temporary profiling is needed, but it's a small price

manio143 · 2023-09-18T22:31:48Z

I'm curious how we ensure profiling call sites get stripped for release/shipping builds?

They don't need to. When the profiler is off the JIT can inline the check and skip the calls to it quite well.

If the Profiler perf is not great it would be good to determine why and see if we can improve it. Maybe it would make sense to gut it even more and use something closer to the runtime? What we need is basically a simple solution to start/stop scopes and gather perf info and output it to some stream once per frame - either consumed by in game UI or by external tools or written to a file, etc.

jazzay · 2023-09-18T22:55:41Z

They don't need to. When the profiler is off the JIT can inline the check and skip the calls to it quite well.

In the current implementation of TracingProfiler/Scope I don't see any is active checks, so how would it avoid that work?

Additionally. what about AOT platforms like iOS, etc?

froce · 2023-10-04T15:24:49Z

I think I've finally figured out what I want to do with this. In a first step, replace the Profiler internal buffers with Channels to get rid of the locking and to get support for multiple consumers. Then I can convert the TraceProfiler callsites to the existing functions/datatypes and add the JSON writer as a second consumer. The only functional change I would want to make here is to stop sending Begin events, they just get filtered out on the receiving side (GameProfilingSystem) anyway, while basically doubling the event count for no benefit.

Then, as a second PR, we could take another look at the existing API and datatypes:

Add caller info
I haven't measured the impact, but ProfilingState and ProfilingEvent are currently both 296 bytes, that's bound to have a perf impact. Attributes make up 144 bytes in each and they are never used, Custom messages are 112 bytes, but at least sometimes used. So we can probably slim that down, or split things up, so only what's used is actually recorded.
Mark events are very rarely used and could basically be replaced with 0 duration timespans. Maybe we should add an API for counters instead
Per frame grouping of events (okay, maybe that's a good use for markers)
Event filtering (e.g. GameProfilingSystem does not need to handle and string format very short scopes)

And then, as a last step, take another look at GPU profiling. Usually I would expect GPU information per queue (on modern APIs), but I don't think that's currently exposed. But I might very well be wrong about that, didn't want to dig to deeply into it for now.

As an alternative to rolling our own we could also look at integrating https://github.com/clibequilibrium/Tracy-CSharp, it's definitely well optimized and feature rich, but ofc it's another (non-C#) dependency (+ imgui) and I don't know what limitations the .NET integration might have.

froce · 2023-10-04T15:39:34Z

Other than that I see mainly that you're adding thread info and maintaining a per thread stack in order to stop the Stopwatch in order for Begin/End pairs (already done by ProfilingState).

The stack is only there to support profiling across function boundaries, with ProfilingState you have to pass it along yourself somehow, if begin and end are not in the same scope. (That also means my attempt at limiting event capturing by stack depth was not working, oops. I will remove it, did not seem useful in practice anyway.)

They don't need to. When the profiler is off the JIT can inline the check and skip the calls to it quite well.

In the current implementation of TracingProfiler/Scope I don't see any is active checks, so how would it avoid that work?

Additionally. what about AOT platforms like iOS, etc?

Yeah, no checks in the current version. I will add runtime checks / use the existing ones in Profiler/ProfilingKey. We could add a compile time way of disabling profiling entirely, if it's needed, but I'd prefer having the profiler available in all builds by default.

manio143 · 2023-10-04T22:37:19Z

I think we might be able to separate the message logging from the profiler. There isn't really a good reason for it to be tied together (except being enabled when profiling is enabled) and it will simplify things a bit.

For having variants of the same profiling key attributes should be used.
And for things like the GC counts we may need a counter primitive as you mentioned instead.

- Flip ordering of Profiling results, to sort from longest to shortest. - Only record `End` events in the Histogram, Begin/Mark events don't have a duration.

…ternally. - Channels simplified the addition of multiple subscribers and should be a big perf win, when collecting events from many threads. - Adds thread information to `ProfilingEvent` and `ProfilingState` - Adapted `GameProfilingSystem` to the changes

to output profiling events in 'chrome://tracing' JSON format, which can be analyzed by ui.perfetto.dev and other tools

froce · 2023-10-06T19:01:44Z

I am basically done with my first pass. If the general approach is okay to merge I will add more tests and fix the comments/docs. Everything else can probably be addressed during review or in follow-up PRs.
The remaining problems are mainly around GPU events, but none of it should affect existing functionality.
Currently I map all GPU events to a single thread id, which is certainly not correct, but I also don't see an easy fix for now.
Syncing CPU and GPU timestamps also isn't working correctly, need to investigate if it's just my logic that's wrong or something else:

Then there's this:

Total time is 16 ms, but threadpool work takes 123 ms combined ;) Should I add sorting by avg time?

For having variants of the same profiling key attributes should be used. And for things like the GC counts we may need a counter primitive as you mentioned instead.

Do you have an example? I still don't think I get it.

ProfilingKeys are particularly awkward for me when profiling around polymorphism, but I don't have a suggestion how to improve it:

//Before:
using (Profiler.Begin($"{child.GetType().Name}.Draw"))
{
    child.Draw(drawContext);
}
//Good. Every subclass gets it's own key.

//Now:
using (Profiler.Begin(DrawChildKey, $"{child.GetType().Name}.Draw"))
{
    child.Draw(drawContext);
}
//Sadness.

Doing it properly now requires some Dictionary<Type, ProfilingKey>, which seems like a lot of effort for something that should be quick to add and remove again. I guess I'll do it for these base in-engine cases where it makes sense, but it's not ideal.

I think we might be able to separate the message logging from the profiler. There isn't really a good reason for it to be tied together (except being enabled when profiling is enabled) and it will simplify things a bit.

It doesn't have to be tied together, but having an easy way to log some specific (e.g. asset loading time) profiling events to a file/console is nice. I'd just like to do it without bloating all other profiling in the process.

First look at performance:

BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19045.3448)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.7.23376.3
  [Host]     : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2
  Job-MBWYOB : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2
  Job-YPSZPM : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2

Method	Job	NuGetReferences	NUM_ENTITIES	WithProfiling	Mean	Error	StdDev	Median	Allocated
Run	Job-MBWYOB	Stride.Core 4.1.1.1	2000	False	6.412 ms	0.1267 ms	0.2645 ms	6.387 ms	12.08 KB
Run	Job-YPSZPM	Stride.Core 4.1.2.1	2000	False	6.348 ms	0.1249 ms	0.1870 ms	6.262 ms	12.08 KB
Run	Job-MBWYOB	Stride.Core 4.1.1.1	2000	True	18.580 ms	0.3451 ms	0.3059 ms	18.606 ms	12.57 KB
Run	Job-YPSZPM	Stride.Core 4.1.2.1	2000	True	7.184 ms	0.1436 ms	0.2766 ms	7.052 ms	12.39 KB

4.1.1.1 is master with the added profiling keys, 4.1.2.1 is this PR, so we can now add profiling to work happening on the threadpool without trashing performance (on desktop at least...).

sources/core/Stride.Core.Tests/TestProfiler.cs

manio143 · 2023-10-07T14:00:30Z

For having variants of the same profiling key attributes should be used. And for things like the GC counts we may need a counter primitive as you mentioned instead.

Do you have an example? I still don't think I get it.

Probably could have been more explicit here.
Let's take this for an example

var url = newUrl ?? reference.Url;
using (var profile = Profiler.Begin(ContentProfilingKeys.ContentReload, url))
{
    DeserializeObject(reference.Url, url, obj.GetType(), obj, settings);
}

For many assets with different URLs we run the same profiling key called ContentReload. If we're interested in measuring which asset takes the longest to reload we should be looking at variants of that key with for example an attribute ("URL", url).

Whereas in this example

using (processor.UpdateProfilingState = Profiler.Begin(processor.UpdateProfilingKey, "Entities: {0}", entities.Count))
{
    processor.Update(gameTime);
}

There isn't any variation in the parameters. UpdateProfilingKey is annotated with a message on entity count.
This should be potentially a separate counter measurement in a tracing system. It's useful to correlate entity count and how long an Update takes, but the number of entities is not inherently a variation of the update process.

using (var profile = Profiler.Begin(GameProfilingKeys.GameSystemLoadContent, gameSystemName))
    contentable.LoadContent();

Another example where the parameter passed is making this a variant, because we want to distinctly identify the load time of different game systems.

So variants are basically a quick way to (!) dynamically separate a profiling key into a few subkeys.
The way the system is designed currently, there is a parent-child relationship between profiling keys.
In practice, however, that relationship is used more for turning things on and off, rather than grouping results logically.
So each profiling key results in a separate measurement entry in the buffer.
With a more OpenTelemetry like approach, a single measurement with attributes can be looked at either as a whole (not caring about variants) or with division into variant specifics (where each attribute creates a "dimension" the measurement can be partiotioned by).

Hope this makes a bit more sense.

manio143

Looks very promising. The perf improvements sound lovely and I see you found that I made a mistake with where I was emitting events to the meter.

Thanks to you I'm now more familiar with built-in channels.
For interested folks, have a look at An Introduction to System.Threading.Channels

sources/core/Stride.Core.Tests/TestProfiler.cs

sources/core/Stride.Core/Diagnostics/Profiler.cs

sources/core/Stride.Core/Diagnostics/ProfilingState.cs

sources/core/Stride.Core/Threading/ThreadPool.cs

sources/engine/Stride.Engine/Profiling/GameProfilingSystem.cs

sources/engine/Stride.Rendering/Rendering/RenderSystem.cs

sources/core/Stride.Core/Diagnostics/Profiler.cs

manio143 · 2023-10-07T17:12:51Z

Total time is 16 ms, but threadpool work takes 123 ms combined ;) Should I add sorting by avg time?

This is why there's a concept of a Histogram in meters. It basically looks at the data in terms of duration and allows to view it by max, min, average and percentiles.

I think adding a display mode with average aggregation over a second would be a great option!

And for the threadpool specifically wouldn't the issue also be around it running on multiple threads in parallel? Thus squeezing more compute time into the same ms?

froce · 2023-10-08T14:49:59Z

And for the threadpool specifically wouldn't the issue also be around it running on multiple threads in parallel? Thus squeezing more compute time into the same ms?

Yes, exactly that.

This is why there's a concept of a Histogram in meters. It basically looks at the data in terms of duration and allows to view it by max, min, average and percentiles.

GameProfilingSystem is already aggregating by RefreshTime (by default KP_Plus/KP_Minus to change) and then rescaling to per-frame where it makes sense.

froce · 2023-10-14T14:53:37Z

That should be it for now.

I've marked it as breaking change, but unless I've missed something, it would only be breaking if someone replaced GameProfilingSystem with their own solution.
Performance is okay, not amazing, but better than before, especially when capturing events from the ThreadPool.
CPU and GPU timelines are not synced correctly, but that only affects the JSON output so far. D3D12 and Vulkan have dedicated APIs for clock calibration, on older APIs we'd need to do it ourselves.

And I imagine the option to output tracing .json and view it externally should also be mentioned in the profiling section of the docs.

manio143

Looks good. I could nitpick the whitespace here and there, but it's fine.

manio143 · 2023-10-15T07:57:33Z

I ran Starbreach project to test those changes. I'm happy to merge it.
Also I verified no changes are needed to GameProfiler script - new sorting mode also works straight away.

However, you may need to look into performance of ChromeTracingProfileWriter. Having it enabled had a big impact on the system with the frame rate dropping by around 10FPS (from 95-100 to 85-90).
Edit: I tried a handful of tricks and nothing seems to improve things - I think it may just be that this system is only meant for capturing small traces (a few frames) rather than have it running for seconds to minutes.

xen2 · 2023-10-16T11:13:25Z

It seems Android build is broken.
https://teamcity.stride3d.net/buildConfiguration/Engine_BuildAndroid/21765?guest=yes

FYI, builds don't auto-run on PR from people who are not yet marked as collaborator or contributors (for security reasons).
However it seems @froce is a collaborator, not sure if he wasn't at the time the commit were pushed, or if something is wrong with teamcity config. Anyhow, if test don't run automatically, I would like to add a simple bot command that contributors could use in a comment on GitHub to trigger build/tests.

sandsalamand · 2023-11-16T11:09:11Z

It would be nice to mention this new ability in the documentation somewhere.

froce force-pushed the trace-profiling branch from 1018f23 to 3504d84 Compare October 5, 2023 20:57

froce added 4 commits October 5, 2023 23:01

Small Profiler fixes

f71c1ed

- Flip ordering of Profiling results, to sort from longest to shortest. - Only record `End` events in the Histogram, Begin/Mark events don't have a duration.

Add some trace profiling scopes.

0a643bf

Add ChromeTracingProfileWriter

7c0dba3

to output profiling events in 'chrome://tracing' JSON format, which can be analyzed by ui.perfetto.dev and other tools

froce force-pushed the trace-profiling branch from 3504d84 to 7c0dba3 Compare October 5, 2023 21:32

Fix tests

b0f881f

froce marked this pull request as ready for review October 6, 2023 19:04

Ethereal77 reviewed Oct 7, 2023

View reviewed changes

sources/core/Stride.Core.Tests/TestProfiler.cs Outdated Show resolved Hide resolved

manio143 reviewed Oct 7, 2023

View reviewed changes

froce added 6 commits October 11, 2023 16:18

Address PR feedback

66a2d20

Add tests and improve comments.

571cbf0

Add GameProfilingSorting.ByAverageTime

8443b61

Revert premature pessimization

f8af06c

Prevent division by 0 in GameProfilingSystem

e767766

Minor ProfilingKey cleanup

e680006

froce requested a review from manio143 October 14, 2023 15:01

manio143 approved these changes Oct 14, 2023

View reviewed changes

manio143 changed the title ~~Add a timeline / tracing profiler~~ [Core] Enable multiple profiler consumers and add a timeline/tracing profiler Oct 15, 2023

manio143 merged commit f16c3c7 into stride3d:master Oct 15, 2023
1 check passed

This was referenced Oct 16, 2023

[Build] Fix Android build error #1949

Merged

perf(core): Performance fixes in GameProfilingSystem and ChromeTracingProfileWriter #1953

Closed

tebjan mentioned this pull request Feb 4, 2024

Streamlines feature highlights stride3d/stride-website#264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Enable multiple profiler consumers and add a timeline/tracing profiler #1788

[Core] Enable multiple profiler consumers and add a timeline/tracing profiler #1788

froce commented Sep 18, 2023 •

edited

manio143 commented Sep 18, 2023

manio143 commented Sep 18, 2023

jazzay commented Sep 18, 2023

froce commented Sep 18, 2023

manio143 commented Sep 18, 2023

jazzay commented Sep 18, 2023 •

edited

froce commented Oct 4, 2023 •

edited

froce commented Oct 4, 2023

manio143 commented Oct 4, 2023

froce commented Oct 6, 2023

manio143 commented Oct 7, 2023

manio143 left a comment

manio143 commented Oct 7, 2023

froce commented Oct 8, 2023

froce commented Oct 14, 2023

manio143 left a comment

manio143 commented Oct 15, 2023 •

edited

xen2 commented Oct 16, 2023 •

edited

sandsalamand commented Nov 16, 2023

[Core] Enable multiple profiler consumers and add a timeline/tracing profiler #1788

[Core] Enable multiple profiler consumers and add a timeline/tracing profiler #1788

Conversation

froce commented Sep 18, 2023 • edited

PR Details

Description

Usage

Related Issue

Motivation and Context

Open Questions / Todo

Types of changes

Checklist

manio143 commented Sep 18, 2023

manio143 commented Sep 18, 2023

jazzay commented Sep 18, 2023

froce commented Sep 18, 2023

manio143 commented Sep 18, 2023

jazzay commented Sep 18, 2023 • edited

froce commented Oct 4, 2023 • edited

froce commented Oct 4, 2023

manio143 commented Oct 4, 2023

froce commented Oct 6, 2023

manio143 commented Oct 7, 2023

manio143 left a comment

Choose a reason for hiding this comment

manio143 commented Oct 7, 2023

froce commented Oct 8, 2023

froce commented Oct 14, 2023

manio143 left a comment

Choose a reason for hiding this comment

manio143 commented Oct 15, 2023 • edited

xen2 commented Oct 16, 2023 • edited

sandsalamand commented Nov 16, 2023

froce commented Sep 18, 2023 •

edited

jazzay commented Sep 18, 2023 •

edited

froce commented Oct 4, 2023 •

edited

manio143 commented Oct 15, 2023 •

edited

xen2 commented Oct 16, 2023 •

edited