Skip to content

[libraries-pgo] Microsoft.Extensions.Hosting.Unit.Tests fails in OSR stress #116212

@jakobbotsch

Description

@jakobbotsch
Member

Pipeline run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1056248&view=results
Example console log: https://helixr1107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-47f137d3961544ab84/Microsoft.Extensions.Hosting.Unit.Tests/1/console.2a54c506.log?helixlogtype=result

set DOTNET_TieredCompilation=1
set DOTNET_TC_OnStackReplacement=1
set DOTNET_TC_QuickJitForLoops=1
set DOTNET_TC_OnStackReplacement_InitialCounter=1
set DOTNET_OSR_HitLimit=1
...
Starting:    Microsoft.Extensions.Hosting.Unit.Tests (parallel test collections = on [4 threads], stop on fail = off)
    Microsoft.Extensions.Hosting.Tests.LifecycleTests.CallbackOrder(concurrently: True) [FAIL]
      System.InvalidOperationException : Unable to activate type 'Microsoft.Extensions.Logging.LoggerFactory'. The following constructors are ambiguous:
      Void .ctor(System.Collections.Generic.IEnumerable`1[Microsoft.Extensions.Logging.ILoggerProvider], Microsoft.Extensions.Options.IOptionsMonitor`1[Microsoft.Extensions.Logging.LoggerFilterOptions], Microsoft.Extensions.Options.IOptions`1[Microsoft.Extensions.Logging.LoggerFactoryOptions], Microsoft.Extensions.Logging.IExternalScopeProvider)
      Void .ctor(System.Collections.Generic.IEnumerable`1[Microsoft.Extensions.Logging.ILoggerProvider], Microsoft.Extensions.Logging.LoggerFilterOptions)

This fails on multiple platforms under OSR stress. cc @dotnet/jit-contrib

Activity

removed
untriagedNew issue has not been triaged by the area owner
on Jun 2, 2025
added this to the 10.0.0 milestone on Jun 2, 2025
AndyAyersMS

AndyAyersMS commented on Jun 3, 2025

@AndyAyersMS
Member

I isolated this to

Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteFactory:CreateArgumentCallSites(Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceIdentifier,System.Type,Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteChain,System.Reflection.ParameterInfo[],ubyte):Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceCallSite[]:this [Tier1-OSR @0x179 with Synthesized PGO, IL size=389, code size=2114, hash=0xec0ac54f]

via bisecting DOTNET_JitEnablePatchpointRange. With OSR suppressed for this hash all the test fail.

From there it appears it is related somehow to a non-phi-based jump-thread we do; we have a redundant dominated compare

--- Trying RBO in BB76 ---
Relop [000198] BB76 value unknown, trying inference

Dominator BB52 of BB76 has relop with same liberal VN
N003 (  3,  3) [000730] J----+-N---                         *  EQ        int    $338
N001 (  1,  1) [000728] -----+-----                         +--*  LCL_VAR   int    V66 tmp47        u:2 $1d5
N002 (  1,  1) [000729] -----+-----                         \--*  CNS_INT   int    0 $c0
 Redundant compare; current relop:
N003 (  3,  3) [000198] J----+-N---                         *  EQ        int    $338
N001 (  1,  1) [000735] -----+-----                         +--*  LCL_VAR   int    V66 tmp47        u:2 (last use) $1d5
N002 (  1,  1) [000197] -----+-----                         \--*  CNS_INT   int    0 $c0
Both successors of idom BB52 reach BB76 -- attempting jump threading
BB52 is a true pred
BB62 is a false pred
BB74 is a false pred
BB75 is a false pred
BB73 is a false pred
Optimizing via jump threading
Jump flow from pred BB52 -> BB76 implies predicate true; we can safely redirect flow to be BB52 -> BB78
Jump flow from pred BB62 -> BB76 implies predicate false; we can safely redirect flow to be BB62 -> BB77
Jump flow from pred BB74 -> BB76 implies predicate false; we can safely redirect flow to be BB74 -> BB77
Jump flow from pred BB75 -> BB76 implies predicate false; we can safely redirect flow to be BB75 -> BB77
Jump flow from pred BB73 -> BB76 implies predicate false; we can safely redirect flow to be BB73 -> BB77
RBO: BB76 is now unreachable, and flow into its successors needs to be removed. Data was already inconsistent.

and if we allow this jump threading then the test fails.

I have not yet determined whether the issue is upstream and the VN/SSA is incorrect (though that seems unlikely) or the issue is in RBO (also not too likely as this is just a structural opt) or somewhere downstream. We are bypassing a global PHI in BB76 so perhaps that is a contributing factor.

Also not yet clear why this just started happening, though perhaps the recent changes in inlining are involved. There is no stack allocation or physical promotion in this method.

AndyAyersMS

AndyAyersMS commented on Jun 4, 2025

@AndyAyersMS
Member

This is a bad interaction between RBO and Assertion Prop (or arguably just a bug in AP).

Before RBO we have this flow graph. Here colored edges are false outcomes. You can see V08.9 can be either null or non-null depending on how the flow evolves from BB51.

Image

RBO jump threads through BB76 since it has the same pred as BB52, giving the following flow graph, and AP runs over this graph

Image

Here BB76 is a now-unreachable pred of BB78 (red edge). For unreachable blocks AP makes all assertions available, that means BB76's out assertion set makes contradictory claims: V08.7 is null and also V08.7 is not null.

Also note there is no PHI arg for the new pred BB52.

AP then walks the PHI in BB78, looking to see if each pred asserts that V08 is non-null, and they all do, and so AP optimizes the branch in BB78.

As for a fix:

  • I don't think we can fix SSA here, we have to tolerate this situation somehow
  • We could look for contradictory assertions and bail out; that might work if unreachable blocks are rare.
  • The "witness" to V08.7 being non-null is BB52 which we never even considered. So perhaps we can bail out of pred phi analysis if the pred list has blocks that are not in the phi pred list.

I am going to try looking for contradictions first.

AndyAyersMS

AndyAyersMS commented on Jun 4, 2025

@AndyAyersMS
Member

FYI @EgorBo subtle issue in phi analysis during AP.
cc @dotnet/jit-contrib

added a commit that references this issue on Jun 5, 2025
51925ba
jakobbotsch

jakobbotsch commented on Jun 5, 2025

@jakobbotsch
MemberAuthor

For unreachable blocks AP makes all assertions available, that means BB76's out assertion set makes contradictory claims: V08.7 is null and also V08.7 is not null.

Is it possible to fix this part instead, for a more future proof fix? For example by making AP only consider generated assertions available out of unreachable blocks that have been jump threaded.

The bug here reminds me of the bug that led to BBF_NO_CSE_IN. We should really consider if we can run RBO as one of the last phases to avoid these "the data is outdated" issues.

AndyAyersMS

AndyAyersMS commented on Jun 6, 2025

@AndyAyersMS
Member

For unreachable blocks AP makes all assertions available, that means BB76's out assertion set makes contradictory claims: V08.7 is null and also V08.7 is not null.

Is it possible to fix this part instead, for a more future proof fix? For example by making AP only consider generated assertions available out of unreachable blocks that have been jump threaded.

The bug here reminds me of the bug that led to BBF_NO_CSE_IN. We should really consider if we can run RBO as one of the last phases to avoid these "the data is outdated" issues.

Yes, the whole thing seems fragile.

I don't quite follow what you're suggesting. Seems like the problematic assertion set could come from some (also unreachable) descendant of a jump threaded block so we'd need to somehow track which unreachable blocks were tainted.

13 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIblocking-clean-ci-optionalBlocking optional rolling runs

Type

No type

Projects

No projects

Relationships

None yet

    Development

    Participants

    @jakobbotsch@AndyAyersMS@JulieLeeMSFT

    Issue actions

      [libraries-pgo] Microsoft.Extensions.Hosting.Unit.Tests fails in OSR stress · Issue #116212 · dotnet/runtime