Skip to content

[Flaky test] Microsoft.Build.UnitTests.BackEnd.TaskBuilder_Tests.CanceledTasksDoNotLogMSB4181 #9638

@surayya-MS

Description

@surayya-MS
Member

Task: Implement MSBuild Cancellation Hang Detection and Diagnostics

Background

The MSBuild test CanceledTasksDoNotLogMSB4181() is experiencing intermittent failures on build machines where the cancellation operation doesn't complete within the expected 2-second timeout. The test validates that when a build is cancelled, MSBuild logs the correct warning (MSB5021) and doesn't log an incorrect error (MSB4181).

Problem Statement

The current test fails with:

Shouldly.ShouldAssertException : isSubmissionCompleted should be True but was False
Additional Info: Waiting for that the build submission is completed failed in the timeout period 2000 ms.

This could indicate either:

  1. Test environment issue: Build machines are slower and need more time
  2. MSBuild bug: Genuine hang or deadlock in the cancellation logic

Objective

Enhance the test to distinguish between timing issues and genuine MSBuild hangs by implementing comprehensive diagnostics and intelligent retry mechanisms.

Requirements

1. Intelligent Timeout Strategy

  • Phase 1: Normal timeout (2-3 seconds) for typical scenarios
  • Phase 2: Extended monitoring (10-15 seconds) with hang detection
  • Adaptive timeouts: Adjust based on system characteristics (CI environment, CPU cores, memory pressure)

2. MSBuild Process Monitoring

Track all MSBuild-related processes during cancellation:

  • msbuild.exe, dotnet.exe, MSBuild.exe
  • VBCSCompiler.exe, csc.exe (compiler processes)
  • cmd.exe, powershell.exe (execution processes)

For each process, monitor:

  • Process ID, name, start time
  • Memory usage (working set)
  • Thread count and thread states
  • CPU time consumption
  • Responsiveness status
  • Command line arguments

3. Hang Pattern Detection

Implement automated detection for common hang scenarios:

Process-related patterns:

  • Process count explosion (too many new processes)
  • Unresponsive processes (not responding to Windows messages)
  • Memory spikes (processes consuming >500MB unexpectedly)
  • Thread explosion (processes with >50 threads)

Performance patterns:

  • High CPU usage without progress indication
  • Consistent timing vs. erratic timing between checks
  • BuildResult state changes (or lack thereof)

4. Diagnostic Data Collection

Thread Analysis:

  • Current process thread states (waiting, running, blocked)
  • Wait reasons for blocked threads
  • Thread count changes over time
  • High ratio of waiting threads (potential deadlock indicator)

Process Dumps:

  • Automatic dump creation at 6-second and 10-second marks
  • Use dotnet-dump tool if available
  • Final dumps if operation never completes
  • Store dumps with meaningful names including PID, timestamp, and context

Event Tracing:

  • Comprehensive event logging with precise timestamps
  • State transitions during cancellation process
  • BuildResult state changes
  • Process lifecycle events

5. Enhanced Failure Analysis

When the test fails, provide:

  • Root cause classification: Timing issue vs. genuine hang
  • Hang pattern summary: Which patterns were detected
  • Process forensics: What processes were involved and their states
  • Actionable recommendations: Retry with longer timeout vs. file MSBuild bug

6. Comprehensive Logging

All diagnostic information should be logged to _testOutput with clear structure:

  • Event timeline with elapsed time markers
  • Process state summaries at key intervals
  • Hang detection results with explanations
  • File paths for created dumps
  • System environment context (CI/local, CPU cores, memory)

Technical Implementation Details

Core Method Signature

private bool WaitWithMSBuildHangDetection(
    BuildSubmission asyncResult, 
    string operationName,
    List<(DateTime Time, string Event, string Details)> eventLog)

Key Data Structures

public class MSBuildProcessInfo
{
    public int Id { get; set; }
    public string Name { get; set; }
    public DateTime StartTime { get; set; }
    public long WorkingSetMB { get; set; }
    public int ThreadCount { get; set; }
    public TimeSpan CpuTime { get; set; }
    public bool IsResponding { get; set; }
    public string CommandLine { get; set; }
}

Required Dependencies

  • System.Diagnostics for process monitoring
  • System.Management for Windows command line detection (may require NuGet package)
  • dotnet-dump tool for dump collection (external dependency)

Success Criteria

Functional Requirements

  1. Reliable hang detection: Correctly identifies when MSBuild is genuinely hung vs. just slow
  2. Comprehensive diagnostics: Collects sufficient data to analyze MSBuild issues
  3. Automatic dump collection: Creates process dumps when hangs are detected
  4. Backward compatibility: Test still validates original MSB4181/MSB5021 behavior

Quality Requirements

  1. Clear failure messages: When test fails, the message clearly indicates if it's a hang or timing issue
  2. Forensic data: Sufficient diagnostic information for MSBuild team to investigate
  3. Performance impact: Diagnostic overhead doesn't significantly slow down normal test execution
  4. Cross-platform support: Works on Windows and Unix-like systems (with appropriate fallbacks)

Expected Outcomes

For Timing Issues

  • Test succeeds on retry with longer timeout
  • Logs show normal BuildResult progression
  • No hang patterns detected
  • Recommendation: Increase base timeout for environment

For Genuine MSBuild Hangs

  • Consistent hang patterns detected across runs
  • Process dumps available for analysis
  • Thread analysis shows blocked/waiting threads
  • BuildResult remains unchanged during extended monitoring
  • Clear evidence for filing MSBuild bug report

Diagnostic Artifacts

  • Process dump files (.dmp) with timestamps and context
  • Detailed execution logs with timing analysis
  • Hang pattern analysis with specific root cause indicators
  • System environment context for reproduction

Acceptance Tests

  1. Normal completion: Test passes quickly when cancellation works properly
  2. Slow environment: Test succeeds with longer timeout on slow systems, logs explain timing
  3. Genuine hang: Test fails with comprehensive diagnostics, creates dumps, identifies hang patterns
  4. Process monitoring: All MSBuild-related processes are tracked throughout execution
  5. Dump creation: Process dumps are successfully created when hangs are detected
  6. Cross-platform: Basic functionality works on both Windows and Linux build agents

This enhanced test will provide definitive evidence to determine whether the timeout failures represent a test infrastructure issue or a genuine MSBuild cancellation bug.

This dump data should be available in build artifacts

Activity

added
Area: Our Own BuildProblems affecting the build or build infrastructure of the MSBuild repo itself.
flaky-testTests that have failed but passed on rerun.
on Jan 12, 2024
self-assigned this
on Jan 12, 2024
surayya-MS

surayya-MS commented on Jan 15, 2024

@surayya-MS
MemberAuthor

This test was previously reported as flaky and there was an attempt to fix it #9298

added
Priority:2Work that is important, but not critical for the release
on Jan 16, 2024
removed their assignment
on Feb 1, 2024
JaynieBai

JaynieBai commented on Mar 11, 2024

@JaynieBai
Member

Still failed in latest build https://dev.azure.com/dnceng-public/public/_build/results?buildId=588018&view=ms.vss-test-web.build-test-results-tab&runId=14181820&resultId=104008

Error message
Shouldly.ShouldAssertException : isSubmissionComplated\r\n should be\r\nTrue\r\n but was\r\nFalse\r\n\r\nAdditional Info:\r\n Waiting for that the build submission is completed failed in the timeout period 2000 ms.

Stack trace
at Microsoft.Build.UnitTests.BackEnd.TaskBuilder_Tests.CanceledTasksDoNotLogMSB4181() in D:\a\1\s\src\Build.UnitTests\BackEnd\TaskBuilder_Tests.cs:line 189
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAtt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Area: Our Own BuildProblems affecting the build or build infrastructure of the MSBuild repo itself.Priority:2Work that is important, but not critical for the releasebacklogflaky-testTests that have failed but passed on rerun.triaged

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Participants

    @JaynieBai@AR-May@f-alizada@surayya-MS

    Issue actions

      [Flaky test] Microsoft.Build.UnitTests.BackEnd.TaskBuilder_Tests.CanceledTasksDoNotLogMSB4181 · Issue #9638 · dotnet/msbuild