Skip to content

Frequent GC Triggered by 'Internal Tuning' on Windows node compared to Linux/ARM node in .NET 8.0.16 #115879

Open
@arekpalinski

Description

@arekpalinski

Description

We (RavenDB team) are investigating an issue related to GC where a RavenDB cluster node running Windows Server 2022 (WIN) experiences significantly more frequent GCs and higher GC pause times compared to its counterpart running on Ubuntu 24.04.2 (ARM), despite both nodes having the same workload and configuration.

This is a replica cluster that receives replicated data from the main cluster, with no external requests. Both nodes have identical databases, configurations, and GC settings specified in the Raven.Server.runtimeconfig.json file:

{
  "runtimeOptions": {
    "tfm": "net8.0",
    "includedFrameworks": [
      {
        "name": "Microsoft.NETCore.App",
        "version": "8.0.16"
      },
      {
        "name": "Microsoft.AspNetCore.App",
        "version": "8.0.16"
      }
    ],
    "configProperties": {
      "System.GC.Concurrent": true,
      "System.GC.Server": true,
      "System.GC.RetainVM": true,
      "System.Reflection.Metadata.MetadataUpdater.IsSupported": false,
      "System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization": false,
      "System.Runtime.TieredPGO": true
    }
  }
}

Both nodes have identical hardware configurations (2 cores, 8 GB memory). The difference lies in their operating systems and processor architecture:

  • WIN: Windows Server 2022 on x64.
  • ARM: Ubuntu 24.04.2 on ARM64.

The issue becomes noticeable after a restart of the cluster nodes (regular updates of RavenDB) but that isn't always the case, it doesn't reproduce always. Initially, GC behavior is similar for both nodes, but within a few minutes, the WIN node begins triggering GCs much more frequently. On analyzing the GC traces, we consistently observe the WIN node triggering GCs due to "Internal Tuning", whereas the ARM node does not exhibit this behavior.

Configuration

  • 2 cores. 8 GB memory

  • WIN: Windows Server 2022 (x64)

  • ARM: Ubuntu 24.04.2 (ARM64)

  • .NET 8.0.16

Analysis

Initially, both nodes show similar GC activity. After a few minutes, however, the GC on WIN becomes significantly more frequent, leading to smaller heap sizes for Gen0 and Gen1, and subsequently, a much higher PauseTimePercentage:

Image

  • GC traces collected using dotnet-trace collect --profile gc-verbose --name Raven.Server --duration 00:05:00 revealed:

    • WIN node: 163 GCs in 5 minutes
      Image

    • ARM node: 47 GCs in 5 minutes
      Image

    • The primary reason for GC on the WIN node is "Internal Tuning", while this is absent on the ARM node.


17 hours later, the issue persists with the WIN node still showing higher GC activity:

Image

  • Pause Time Percentage (PauseTimePercentage):

    • WIN node: 4.69%
    • ARM node: Stable and much lower
  • GC traces for WIN node - there are "Internal Tuning" reasons but there are also GC where the reason isn't specified:
    Image

Regression?

This does not appear to be a regression, as we have observed similar behavior in the past.


We are seeking assistance in understanding:

  1. What does "Internal Tuning" mean in this context, and why might it disproportionately affect the WIN node?
  2. Are there GC or runtime optimizations that are platform/architecture-specific that could explain this behavior?
  3. Are there any additional steps we should take to investigate or mitigate this problem?

Detailed insights on "Internal Tuning" triggers and how they differ between platforms would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions