Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Enable multiple profiler consumers and add a timeline/tracing profiler #1788

Merged
merged 11 commits into from Oct 15, 2023

Conversation

froce
Copy link
Collaborator

@froce froce commented Sep 18, 2023

PR Details

Description

This PR adds a profiler outputting data in chrome://tracing format (for now) and changes Profiler to make that possible.

ChromeTracingProfileWriter consumes events and dumps them to a file which can then be analyzed with tools like chrome://tracing, perfetto or speedscope.

Usage

//Start
ChromeTracingProfileWriter.Start(string outputPath, bool indentOutput = false)
//Stop
ChromeTracingProfiler.Stop()

//To add another output format/client and receive events use
Profiler.Subscribe() and Profiler.Unsubscribe()

Related Issue

N/A

Motivation and Context

I had trouble pinpointing performance issues with both the existing Profiler and external tools like dotTrace, so I started experimenting with adding something in the vein of chrome://tracing or tracy. In particular the exisiting Profiler is quite heavy-weight and doesn't play well with multithreading, while everything happening inside the ThreadPool / Dispatcher is difficult to understand with sampling profilers as a lot of work happens in anonymous functions.

The TraceProfiler lets you see work on all threads and lets you look at individual frames to analyze spikes instead of just averages. For me this was also very helpful in understanding the structure of the engine.

An example trace from a rendering stresstest opened in perfetto:
grafik

The example trace:
tracing_example.json.gz

Open Questions / Todo

  • Decide what to do with this. I've already found this very useful compared with the existing Profiler, so I'd like to see it added, but of course having multiple internal profilers is not ideal. Merging it with Profiler seemed quite challenging, but maybe it's a bit easier after the refactoring in Refactor Profiler and submit metrics to .NET Diagnostics API #1774 and I should take another look.
  • Add tests.
  • Update outdated comments and add new ones where needed.
  • Verify performance.
  • Update docs.

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My change requires a change to the documentation.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@manio143
Copy link
Member

That is very interesting. I want to see this as an opt-in. And given the implementation is not large I think we could merge it into the source code.
What do you think about implementing a hook into the existing Profiler?
Something like

public delegate void EventProcessedHandler(ref ProfilingEvent profilingEvent, ProfilingEventType eventType);
public static event EventProcessedHandler EventProcessed;

// at the end of Profiler.ProcessEvent
EventProcessed?.Invoke(ref profilingEvent, eventType);

// In your implementation a new static method
TraceProfiler.StartCollectingEvents() => Profiler.EventProcessed += OnProfilingEventProcessed;
TraceProfiler.StopCollectingEvents() => Profiler.EventProcessed -= OnProfilingEventProcessed;

And then just add more ProfilingKeys to places where they were missing - like you did with your implementation specifically?

And if there's something in the current Profiler implementation that prevents it from being that simple - maybe we can look at why is that?

@manio143
Copy link
Member

We can easily add caller info into the Profiler. Other than that I see mainly that you're adding thread info and maintaining a per thread stack in order to stop the Stopwatch in order for Begin/End pairs (already done by ProfilingState).
The last difference - instead of using a per frame ring buffer to store events you use async channels to stream them to a file.

@jazzay
Copy link

jazzay commented Sep 18, 2023

I'm curious how we ensure profiling call sites get stripped for release/shipping builds?

@froce
Copy link
Collaborator Author

froce commented Sep 18, 2023

And then just add more ProfilingKeys to places where they were missing - like you did with your implementation specifically?

And if there's something in the current Profiler implementation that prevents it from being that simple - maybe we can look at why is that?

My first concern here would be performance. How big of a deal that is really depends on what you're profiling: A few hundred scopes spread around should be fine, thousands of scopes in a small function that gets called by worker threads is perf death by synchronization overhead.

As a quick best-cast vs worst-case example this is just collecting 100k events, without any output, either in a regular for loop or in a Parallel.For:

BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19045.3324)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.7.23376.3
  [Host]     : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2
  Job-JPLRAA : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2

InvocationCount=1  UnrollFactor=1  
Method Mean Error StdDev Ratio RatioSD
StrideProfiler 42.977 ms 0.7726 ms 0.6032 ms 3.85 0.35
Trace 11.706 ms 0.6019 ms 1.7556 ms 1.00 0.00
StrideProfilerParallel 77.534 ms 1.5432 ms 2.8985 ms 6.86 0.75
TraceParallel 2.490 ms 0.1216 ms 0.3566 ms 0.22 0.05

I also thought that adding lower level profiling keys would just clutter the existing profiling UI.
Maybe adding a ProfilingKeyFlags.TracingOnly would be a good option.

I think my other issues were really about Profiler itself and shouldn't affect this PR, your PR has already improved it and the rest can wait for a follow-up.
I'm also not a fan of how adding/removing ProfilingKeys even for temporary profiling is needed, but it's a small price

@manio143
Copy link
Member

I'm curious how we ensure profiling call sites get stripped for release/shipping builds?

They don't need to. When the profiler is off the JIT can inline the check and skip the calls to it quite well.

If the Profiler perf is not great it would be good to determine why and see if we can improve it. Maybe it would make sense to gut it even more and use something closer to the runtime? What we need is basically a simple solution to start/stop scopes and gather perf info and output it to some stream once per frame - either consumed by in game UI or by external tools or written to a file, etc.

@jazzay
Copy link

jazzay commented Sep 18, 2023

They don't need to. When the profiler is off the JIT can inline the check and skip the calls to it quite well.

In the current implementation of TracingProfiler/Scope I don't see any is active checks, so how would it avoid that work?

Additionally. what about AOT platforms like iOS, etc?

@froce
Copy link
Collaborator Author

froce commented Oct 4, 2023

I think I've finally figured out what I want to do with this. In a first step, replace the Profiler internal buffers with Channels to get rid of the locking and to get support for multiple consumers. Then I can convert the TraceProfiler callsites to the existing functions/datatypes and add the JSON writer as a second consumer. The only functional change I would want to make here is to stop sending Begin events, they just get filtered out on the receiving side (GameProfilingSystem) anyway, while basically doubling the event count for no benefit.

Then, as a second PR, we could take another look at the existing API and datatypes:

  • Add caller info
  • I haven't measured the impact, but ProfilingState and ProfilingEvent are currently both 296 bytes, that's bound to have a perf impact. Attributes make up 144 bytes in each and they are never used, Custom messages are 112 bytes, but at least sometimes used. So we can probably slim that down, or split things up, so only what's used is actually recorded.
  • Mark events are very rarely used and could basically be replaced with 0 duration timespans. Maybe we should add an API for counters instead
  • Per frame grouping of events (okay, maybe that's a good use for markers)
  • Event filtering (e.g. GameProfilingSystem does not need to handle and string format very short scopes)

And then, as a last step, take another look at GPU profiling. Usually I would expect GPU information per queue (on modern APIs), but I don't think that's currently exposed. But I might very well be wrong about that, didn't want to dig to deeply into it for now.

As an alternative to rolling our own we could also look at integrating https://github.com/clibequilibrium/Tracy-CSharp, it's definitely well optimized and feature rich, but ofc it's another (non-C#) dependency (+ imgui) and I don't know what limitations the .NET integration might have.

@froce
Copy link
Collaborator Author

froce commented Oct 4, 2023

Other than that I see mainly that you're adding thread info and maintaining a per thread stack in order to stop the Stopwatch in order for Begin/End pairs (already done by ProfilingState).

The stack is only there to support profiling across function boundaries, with ProfilingState you have to pass it along yourself somehow, if begin and end are not in the same scope. (That also means my attempt at limiting event capturing by stack depth was not working, oops. I will remove it, did not seem useful in practice anyway.)

They don't need to. When the profiler is off the JIT can inline the check and skip the calls to it quite well.

In the current implementation of TracingProfiler/Scope I don't see any is active checks, so how would it avoid that work?

Additionally. what about AOT platforms like iOS, etc?

Yeah, no checks in the current version. I will add runtime checks / use the existing ones in Profiler/ProfilingKey. We could add a compile time way of disabling profiling entirely, if it's needed, but I'd prefer having the profiler available in all builds by default.

@manio143
Copy link
Member

manio143 commented Oct 4, 2023

I think we might be able to separate the message logging from the profiler. There isn't really a good reason for it to be tied together (except being enabled when profiling is enabled) and it will simplify things a bit.

For having variants of the same profiling key attributes should be used.
And for things like the GC counts we may need a counter primitive as you mentioned instead.

- Flip ordering of Profiling results, to sort from longest to shortest.
- Only record `End` events in the Histogram, Begin/Mark events don't have a duration.
…ternally.

- Channels simplified the addition of multiple subscribers and should be a big perf win, when collecting events from many threads.
- Adds thread information to `ProfilingEvent` and `ProfilingState`
- Adapted `GameProfilingSystem` to the changes
to output profiling events in 'chrome://tracing' JSON format, which can be analyzed by ui.perfetto.dev and other tools
@froce
Copy link
Collaborator Author

froce commented Oct 6, 2023

I am basically done with my first pass. If the general approach is okay to merge I will add more tests and fix the comments/docs. Everything else can probably be addressed during review or in follow-up PRs.
The remaining problems are mainly around GPU events, but none of it should affect existing functionality.
Currently I map all GPU events to a single thread id, which is certainly not correct, but I also don't see an easy fix for now.
Syncing CPU and GPU timestamps also isn't working correctly, need to investigate if it's just my logic that's wrong or something else:
grafik

Then there's this:
grafik
Total time is 16 ms, but threadpool work takes 123 ms combined ;) Should I add sorting by avg time?

For having variants of the same profiling key attributes should be used. And for things like the GC counts we may need a counter primitive as you mentioned instead.

Do you have an example? I still don't think I get it.

ProfilingKeys are particularly awkward for me when profiling around polymorphism, but I don't have a suggestion how to improve it:

//Before:
using (Profiler.Begin($"{child.GetType().Name}.Draw"))
{
    child.Draw(drawContext);
}
//Good. Every subclass gets it's own key.

//Now:
using (Profiler.Begin(DrawChildKey, $"{child.GetType().Name}.Draw"))
{
    child.Draw(drawContext);
}
//Sadness.

Doing it properly now requires some Dictionary<Type, ProfilingKey>, which seems like a lot of effort for something that should be quick to add and remove again. I guess I'll do it for these base in-engine cases where it makes sense, but it's not ideal.

I think we might be able to separate the message logging from the profiler. There isn't really a good reason for it to be tied together (except being enabled when profiling is enabled) and it will simplify things a bit.

It doesn't have to be tied together, but having an easy way to log some specific (e.g. asset loading time) profiling events to a file/console is nice. I'd just like to do it without bloating all other profiling in the process.

First look at performance:

BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19045.3448)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.7.23376.3
  [Host]     : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2
  Job-MBWYOB : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2
  Job-YPSZPM : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT AVX2

Method Job NuGetReferences NUM_ENTITIES WithProfiling Mean Error StdDev Median Allocated
Run Job-MBWYOB Stride.Core 4.1.1.1 2000 False 6.412 ms 0.1267 ms 0.2645 ms 6.387 ms 12.08 KB
Run Job-YPSZPM Stride.Core 4.1.2.1 2000 False 6.348 ms 0.1249 ms 0.1870 ms 6.262 ms 12.08 KB
Run Job-MBWYOB Stride.Core 4.1.1.1 2000 True 18.580 ms 0.3451 ms 0.3059 ms 18.606 ms 12.57 KB
Run Job-YPSZPM Stride.Core 4.1.2.1 2000 True 7.184 ms 0.1436 ms 0.2766 ms 7.052 ms 12.39 KB

4.1.1.1 is master with the added profiling keys, 4.1.2.1 is this PR, so we can now add profiling to work happening on the threadpool without trashing performance (on desktop at least...).

@froce froce marked this pull request as ready for review October 6, 2023 19:04
@manio143
Copy link
Member

manio143 commented Oct 7, 2023

For having variants of the same profiling key attributes should be used. And for things like the GC counts we may need a counter primitive as you mentioned instead.

Do you have an example? I still don't think I get it.

Probably could have been more explicit here.
Let's take this for an example

var url = newUrl ?? reference.Url;
using (var profile = Profiler.Begin(ContentProfilingKeys.ContentReload, url))
{
    DeserializeObject(reference.Url, url, obj.GetType(), obj, settings);
}

For many assets with different URLs we run the same profiling key called ContentReload. If we're interested in measuring which asset takes the longest to reload we should be looking at variants of that key with for example an attribute ("URL", url).

Whereas in this example

using (processor.UpdateProfilingState = Profiler.Begin(processor.UpdateProfilingKey, "Entities: {0}", entities.Count))
{
    processor.Update(gameTime);
}

There isn't any variation in the parameters. UpdateProfilingKey is annotated with a message on entity count.
This should be potentially a separate counter measurement in a tracing system. It's useful to correlate entity count and how long an Update takes, but the number of entities is not inherently a variation of the update process.

using (var profile = Profiler.Begin(GameProfilingKeys.GameSystemLoadContent, gameSystemName))
    contentable.LoadContent();

Another example where the parameter passed is making this a variant, because we want to distinctly identify the load time of different game systems.


So variants are basically a quick way to (!) dynamically separate a profiling key into a few subkeys.
The way the system is designed currently, there is a parent-child relationship between profiling keys.
In practice, however, that relationship is used more for turning things on and off, rather than grouping results logically.
So each profiling key results in a separate measurement entry in the buffer.
With a more OpenTelemetry like approach, a single measurement with attributes can be looked at either as a whole (not caring about variants) or with division into variant specifics (where each attribute creates a "dimension" the measurement can be partiotioned by).

Hope this makes a bit more sense.

Copy link
Member

@manio143 manio143 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very promising. The perf improvements sound lovely and I see you found that I made a mistake with where I was emitting events to the meter.

Thanks to you I'm now more familiar with built-in channels.
For interested folks, have a look at An Introduction to System.Threading.Channels

sources/core/Stride.Core.Tests/TestProfiler.cs Outdated Show resolved Hide resolved
sources/core/Stride.Core/Diagnostics/Profiler.cs Outdated Show resolved Hide resolved
sources/core/Stride.Core/Diagnostics/ProfilingState.cs Outdated Show resolved Hide resolved
sources/core/Stride.Core/Threading/ThreadPool.cs Outdated Show resolved Hide resolved
sources/engine/Stride.Rendering/Rendering/RenderSystem.cs Outdated Show resolved Hide resolved
sources/engine/Stride.Rendering/Rendering/RenderSystem.cs Outdated Show resolved Hide resolved
sources/core/Stride.Core/Diagnostics/Profiler.cs Outdated Show resolved Hide resolved
@manio143
Copy link
Member

manio143 commented Oct 7, 2023

Total time is 16 ms, but threadpool work takes 123 ms combined ;) Should I add sorting by avg time?

This is why there's a concept of a Histogram in meters. It basically looks at the data in terms of duration and allows to view it by max, min, average and percentiles.

I think adding a display mode with average aggregation over a second would be a great option!

And for the threadpool specifically wouldn't the issue also be around it running on multiple threads in parallel? Thus squeezing more compute time into the same ms?

@froce
Copy link
Collaborator Author

froce commented Oct 8, 2023

And for the threadpool specifically wouldn't the issue also be around it running on multiple threads in parallel? Thus squeezing more compute time into the same ms?

Yes, exactly that.

This is why there's a concept of a Histogram in meters. It basically looks at the data in terms of duration and allows to view it by max, min, average and percentiles.

GameProfilingSystem is already aggregating by RefreshTime (by default KP_Plus/KP_Minus to change) and then rescaling to per-frame where it makes sense.

@froce
Copy link
Collaborator Author

froce commented Oct 14, 2023

That should be it for now.

I've marked it as breaking change, but unless I've missed something, it would only be breaking if someone replaced GameProfilingSystem with their own solution.
Performance is okay, not amazing, but better than before, especially when capturing events from the ThreadPool.
CPU and GPU timelines are not synced correctly, but that only affects the JSON output so far. D3D12 and Vulkan have dedicated APIs for clock calibration, on older APIs we'd need to do it ourselves.

And I imagine the option to output tracing .json and view it externally should also be mentioned in the profiling section of the docs.

@froce froce requested a review from manio143 October 14, 2023 15:01
Copy link
Member

@manio143 manio143 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I could nitpick the whitespace here and there, but it's fine.

@manio143
Copy link
Member

manio143 commented Oct 15, 2023

I ran Starbreach project to test those changes. I'm happy to merge it.
Also I verified no changes are needed to GameProfiler script - new sorting mode also works straight away.

However, you may need to look into performance of ChromeTracingProfileWriter. Having it enabled had a big impact on the system with the frame rate dropping by around 10FPS (from 95-100 to 85-90).
Edit: I tried a handful of tricks and nothing seems to improve things - I think it may just be that this system is only meant for capturing small traces (a few frames) rather than have it running for seconds to minutes.

@manio143 manio143 changed the title Add a timeline / tracing profiler [Core] Enable multiple profiler consumers and add a timeline/tracing profiler Oct 15, 2023
@manio143 manio143 merged commit f16c3c7 into stride3d:master Oct 15, 2023
1 check passed
@xen2
Copy link
Member

xen2 commented Oct 16, 2023

It seems Android build is broken.
https://teamcity.stride3d.net/buildConfiguration/Engine_BuildAndroid/21765?guest=yes

FYI, builds don't auto-run on PR from people who are not yet marked as collaborator or contributors (for security reasons).
However it seems @froce is a collaborator, not sure if he wasn't at the time the commit were pushed, or if something is wrong with teamcity config. Anyhow, if test don't run automatically, I would like to add a simple bot command that contributors could use in a comment on GitHub to trigger build/tests.

@sandsalamand
Copy link

It would be nice to mention this new ability in the documentation somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants