Description
Description
We (RavenDB team) are investigating an issue related to GC where a RavenDB cluster node running Windows Server 2022 (WIN
) experiences significantly more frequent GCs and higher GC pause times compared to its counterpart running on Ubuntu 24.04.2 (ARM
), despite both nodes having the same workload and configuration.
This is a replica cluster that receives replicated data from the main cluster, with no external requests. Both nodes have identical databases, configurations, and GC settings specified in the Raven.Server.runtimeconfig.json
file:
{
"runtimeOptions": {
"tfm": "net8.0",
"includedFrameworks": [
{
"name": "Microsoft.NETCore.App",
"version": "8.0.16"
},
{
"name": "Microsoft.AspNetCore.App",
"version": "8.0.16"
}
],
"configProperties": {
"System.GC.Concurrent": true,
"System.GC.Server": true,
"System.GC.RetainVM": true,
"System.Reflection.Metadata.MetadataUpdater.IsSupported": false,
"System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization": false,
"System.Runtime.TieredPGO": true
}
}
}
Both nodes have identical hardware configurations (2 cores, 8 GB memory). The difference lies in their operating systems and processor architecture:
WIN
: Windows Server 2022 on x64.ARM
: Ubuntu 24.04.2 on ARM64.
The issue becomes noticeable after a restart of the cluster nodes (regular updates of RavenDB) but that isn't always the case, it doesn't reproduce always. Initially, GC behavior is similar for both nodes, but within a few minutes, the WIN
node begins triggering GCs much more frequently. On analyzing the GC traces, we consistently observe the WIN
node triggering GCs due to "Internal Tuning", whereas the ARM
node does not exhibit this behavior.
Configuration
-
2 cores. 8 GB memory
-
WIN
: Windows Server 2022 (x64) -
ARM
: Ubuntu 24.04.2 (ARM64) -
.NET 8.0.16
Analysis
Initially, both nodes show similar GC activity. After a few minutes, however, the GC on WIN
becomes significantly more frequent, leading to smaller heap sizes for Gen0 and Gen1, and subsequently, a much higher PauseTimePercentage:
-
GC traces collected using
dotnet-trace collect --profile gc-verbose --name Raven.Server --duration 00:05:00
revealed:
17 hours later, the issue persists with the WIN
node still showing higher GC activity:
-
Pause Time Percentage (PauseTimePercentage):
WIN
node: 4.69%ARM
node: Stable and much lower
-
GC traces for
WIN
node - there are "Internal Tuning" reasons but there are also GC where the reason isn't specified:
Regression?
This does not appear to be a regression, as we have observed similar behavior in the past.
We are seeking assistance in understanding:
- What does "Internal Tuning" mean in this context, and why might it disproportionately affect the
WIN
node? - Are there GC or runtime optimizations that are platform/architecture-specific that could explain this behavior?
- Are there any additional steps we should take to investigate or mitigate this problem?
Detailed insights on "Internal Tuning" triggers and how they differ between platforms would be greatly appreciated.