-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Task: Implement MSBuild Cancellation Hang Detection and Diagnostics
Background
The MSBuild test CanceledTasksDoNotLogMSB4181()
is experiencing intermittent failures on build machines where the cancellation operation doesn't complete within the expected 2-second timeout. The test validates that when a build is cancelled, MSBuild logs the correct warning (MSB5021) and doesn't log an incorrect error (MSB4181).
Problem Statement
The current test fails with:
Shouldly.ShouldAssertException : isSubmissionCompleted should be True but was False
Additional Info: Waiting for that the build submission is completed failed in the timeout period 2000 ms.
This could indicate either:
- Test environment issue: Build machines are slower and need more time
- MSBuild bug: Genuine hang or deadlock in the cancellation logic
Objective
Enhance the test to distinguish between timing issues and genuine MSBuild hangs by implementing comprehensive diagnostics and intelligent retry mechanisms.
Requirements
1. Intelligent Timeout Strategy
- Phase 1: Normal timeout (2-3 seconds) for typical scenarios
- Phase 2: Extended monitoring (10-15 seconds) with hang detection
- Adaptive timeouts: Adjust based on system characteristics (CI environment, CPU cores, memory pressure)
2. MSBuild Process Monitoring
Track all MSBuild-related processes during cancellation:
msbuild.exe
,dotnet.exe
,MSBuild.exe
VBCSCompiler.exe
,csc.exe
(compiler processes)cmd.exe
,powershell.exe
(execution processes)
For each process, monitor:
- Process ID, name, start time
- Memory usage (working set)
- Thread count and thread states
- CPU time consumption
- Responsiveness status
- Command line arguments
3. Hang Pattern Detection
Implement automated detection for common hang scenarios:
Process-related patterns:
- Process count explosion (too many new processes)
- Unresponsive processes (not responding to Windows messages)
- Memory spikes (processes consuming >500MB unexpectedly)
- Thread explosion (processes with >50 threads)
Performance patterns:
- High CPU usage without progress indication
- Consistent timing vs. erratic timing between checks
- BuildResult state changes (or lack thereof)
4. Diagnostic Data Collection
Thread Analysis:
- Current process thread states (waiting, running, blocked)
- Wait reasons for blocked threads
- Thread count changes over time
- High ratio of waiting threads (potential deadlock indicator)
Process Dumps:
- Automatic dump creation at 6-second and 10-second marks
- Use
dotnet-dump
tool if available - Final dumps if operation never completes
- Store dumps with meaningful names including PID, timestamp, and context
Event Tracing:
- Comprehensive event logging with precise timestamps
- State transitions during cancellation process
- BuildResult state changes
- Process lifecycle events
5. Enhanced Failure Analysis
When the test fails, provide:
- Root cause classification: Timing issue vs. genuine hang
- Hang pattern summary: Which patterns were detected
- Process forensics: What processes were involved and their states
- Actionable recommendations: Retry with longer timeout vs. file MSBuild bug
6. Comprehensive Logging
All diagnostic information should be logged to _testOutput
with clear structure:
- Event timeline with elapsed time markers
- Process state summaries at key intervals
- Hang detection results with explanations
- File paths for created dumps
- System environment context (CI/local, CPU cores, memory)
Technical Implementation Details
Core Method Signature
private bool WaitWithMSBuildHangDetection(
BuildSubmission asyncResult,
string operationName,
List<(DateTime Time, string Event, string Details)> eventLog)
Key Data Structures
public class MSBuildProcessInfo
{
public int Id { get; set; }
public string Name { get; set; }
public DateTime StartTime { get; set; }
public long WorkingSetMB { get; set; }
public int ThreadCount { get; set; }
public TimeSpan CpuTime { get; set; }
public bool IsResponding { get; set; }
public string CommandLine { get; set; }
}
Required Dependencies
System.Diagnostics
for process monitoringSystem.Management
for Windows command line detection (may require NuGet package)dotnet-dump
tool for dump collection (external dependency)
Success Criteria
Functional Requirements
- Reliable hang detection: Correctly identifies when MSBuild is genuinely hung vs. just slow
- Comprehensive diagnostics: Collects sufficient data to analyze MSBuild issues
- Automatic dump collection: Creates process dumps when hangs are detected
- Backward compatibility: Test still validates original MSB4181/MSB5021 behavior
Quality Requirements
- Clear failure messages: When test fails, the message clearly indicates if it's a hang or timing issue
- Forensic data: Sufficient diagnostic information for MSBuild team to investigate
- Performance impact: Diagnostic overhead doesn't significantly slow down normal test execution
- Cross-platform support: Works on Windows and Unix-like systems (with appropriate fallbacks)
Expected Outcomes
For Timing Issues
- Test succeeds on retry with longer timeout
- Logs show normal BuildResult progression
- No hang patterns detected
- Recommendation: Increase base timeout for environment
For Genuine MSBuild Hangs
- Consistent hang patterns detected across runs
- Process dumps available for analysis
- Thread analysis shows blocked/waiting threads
- BuildResult remains unchanged during extended monitoring
- Clear evidence for filing MSBuild bug report
Diagnostic Artifacts
- Process dump files (
.dmp
) with timestamps and context - Detailed execution logs with timing analysis
- Hang pattern analysis with specific root cause indicators
- System environment context for reproduction
Acceptance Tests
- Normal completion: Test passes quickly when cancellation works properly
- Slow environment: Test succeeds with longer timeout on slow systems, logs explain timing
- Genuine hang: Test fails with comprehensive diagnostics, creates dumps, identifies hang patterns
- Process monitoring: All MSBuild-related processes are tracked throughout execution
- Dump creation: Process dumps are successfully created when hangs are detected
- Cross-platform: Basic functionality works on both Windows and Linux build agents
This enhanced test will provide definitive evidence to determine whether the timeout failures represent a test infrastructure issue or a genuine MSBuild cancellation bug.
This dump data should be available in build artifacts
Activity
surayya-MS commentedon Jan 15, 2024
This test was previously reported as flaky and there was an attempt to fix it #9298
JaynieBai commentedon Mar 11, 2024
Still failed in latest build https://dev.azure.com/dnceng-public/public/_build/results?buildId=588018&view=ms.vss-test-web.build-test-results-tab&runId=14181820&resultId=104008
Error message
Shouldly.ShouldAssertException : isSubmissionComplated\r\n should be\r\nTrue\r\n but was\r\nFalse\r\n\r\nAdditional Info:\r\n Waiting for that the build submission is completed failed in the timeout period 2000 ms.
Stack trace
at Microsoft.Build.UnitTests.BackEnd.TaskBuilder_Tests.CanceledTasksDoNotLogMSB4181() in D:\a\1\s\src\Build.UnitTests\BackEnd\TaskBuilder_Tests.cs:line 189
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAtt
f-alizada commentedon Apr 18, 2024
Another record of failed test: https://dev.azure.com/dnceng-public/public/_build/results?buildId=646890&view=ms.vss-test-web.build-test-results-tab