Skip to content

20170717 try surf (junk)#1

Closed
rfay wants to merge 26 commits into
masterfrom
20170717_try_surf
Closed

20170717 try surf (junk)#1
rfay wants to merge 26 commits into
masterfrom
20170717_try_surf

Conversation

@rfay
Copy link
Copy Markdown
Owner

@rfay rfay commented Jul 17, 2017

The Problem:

The Fix:

The Test:

Automation Overview:

Related Issue Link(s):

Release/Deployment notes:

@rfay rfay force-pushed the 20170717_try_surf branch 3 times, most recently from 89ace32 to ed6399b Compare July 17, 2017 23:45
@rfay rfay force-pushed the 20170717_try_surf branch from ed6399b to ea7908b Compare July 18, 2017 16:49
@rfay rfay force-pushed the 20170717_try_surf branch from b9381e8 to daeefaf Compare July 18, 2017 16:59
@rfay rfay force-pushed the 20170717_try_surf branch from daeefaf to 41ec69c Compare July 18, 2017 17:01
@rfay rfay force-pushed the 20170717_try_surf branch from 42c8dca to 7a64108 Compare July 18, 2017 18:03
@rfay rfay closed this Aug 7, 2017
@rfay rfay deleted the 20170717_try_surf branch January 4, 2018 21:30
rfay pushed a commit that referenced this pull request Apr 30, 2018
Problem:
```
The current host header value does not match the configured trusted hosts pattern! Check the pattern defined in $GLOBALS['TYPO3_CONF_VARS']['SYS']['trustedHostsPattern'] and adapt it, if you want to allow the current host header 'website.ddev.local:8080' for your installation.
```
rfay added a commit that referenced this pull request Apr 1, 2026
…eContainer 30m hangs

## The Issue

- Related to ddev#8265 - ContainerInspect polling still produces 30m hangs on Lima/Colima-VZ

On macOS with Lima (both lima-VZ and colima-VZ), `RunSimpleContainer` calls for
trivial commands (cat a file, ls, push traefik config) are timing out at exactly
30 minutes with: "timed out after 30m0s waiting for container X to stop"

PR ddev#8265 replaced `ContainerWait` (which hung indefinitely) with `ContainerInspect`
polling and a 30-minute context deadline. The 30m timeout now fires where previously
we saw 4h hangs, but the root cause is not yet resolved.

## How This PR Solves The Issue

Adds `util.Debug` logging around the `ContainerInspect` polling loop to distinguish
between two failure modes:

**Mode A**: `ContainerInspect` blocks on the socket proxy (Lima/Colima) and doesn't
return until the 30-minute context deadline fires. Each individual call blocks.
Symptom: "attempt #1" logged, "returned after" never logged until ~30m later.

**Mode B**: `ContainerInspect` returns quickly but reports `Running=true` for 30
minutes because the Docker daemon on Lima has stale/incorrect container state.
Symptom: rapid "returned after Xms" messages all showing Running=true.

With `DDEV_DEBUG=true` the logs will show which mode is occurring, enabling a
targeted fix.

## Candidate Fixes (to be applied once root cause is confirmed)

### If Mode A (ContainerInspect blocks):

The fix is a per-call short timeout using goroutines, since Go context cancellation
may not unblock a stuck OS-level socket read on Lima's proxy:

```go
const perCallTimeout = 10 * time.Second
for {
    ch := make(chan inspectResult, 1)
    go func() {
        callCtx, cancel := context.WithTimeout(context.Background(), perCallTimeout)
        defer cancel()
        info, err := apiClient.ContainerInspect(callCtx, c.ID, ...)
        ch <- inspectResult{info, err}
    }()
    select {
    case <-waitCtx.Done(): return timeout error
    case res := <-ch:
        if res.err == nil && !res.info.State.Running { break }
        // err or still running: fall through to tick
    }
    select {
    case <-waitCtx.Done(): return timeout error
    case <-tickChan.C:
    }
}
```

Goroutine leak is bounded (max timeout/perCallTimeout per call) and acceptable.
If the container exits and ContainerInspect subsequently hangs once, the goroutine
for that call leaks but the next call returns quickly and we proceed.

### If Mode B (stale Running=true):

The Docker daemon on Lima isn't getting container exit events. Options:
- Use `docker` CLI via `exec.CommandContext` to check state (fresh socket connection each call)
- Force-kill the container after a shorter threshold (e.g. 60s) if it's still "Running"
  but was started for a trivial command
- Investigate Lima's Docker daemon event propagation

### Other considerations

- Both failures have been seen on lima-VZ and colima-VZ builds, not on other platforms
- The commands involved are trivial: read a file, list directory contents, push traefik config
- A container running `cat file && exit` should complete in <100ms

## Manual Testing Instructions

1. On a Lima or Colima-VZ Mac: `DDEV_DEBUG=true ddev start` for a project that triggers `GetExistingDBType` or Traefik config push
2. Look for `RunSimpleContainer: ContainerInspect attempt #1` in output
3. Check if "returned after" appears immediately or only after 30m
4. Report which mode is occurring

## Automated Testing Overview

No new tests - this is a diagnostic-only change to gather information for the fix.

## Release/Deployment Notes

Debug-only logging - no behavior change. Logs only appear with `DDEV_DEBUG=true`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rfay added a commit that referenced this pull request Apr 1, 2026
…eContainer 30m hangs

## The Issue

- Related to ddev#8265 - ContainerInspect polling still produces 30m hangs on Lima/Colima-VZ

On macOS with Lima (both lima-VZ and colima-VZ), `RunSimpleContainer` calls for
trivial commands (cat a file, ls, push traefik config) are timing out at exactly
30 minutes with: "timed out after 30m0s waiting for container X to stop"

PR ddev#8265 replaced `ContainerWait` (which hung indefinitely) with `ContainerInspect`
polling and a 30-minute context deadline. The 30m timeout now fires where previously
we saw 4h hangs, but the root cause is not yet resolved.

## How This PR Solves The Issue

Adds `util.Debug` logging around the `ContainerInspect` polling loop to distinguish
between two failure modes:

**Mode A**: `ContainerInspect` blocks on the socket proxy (Lima/Colima) and doesn't
return until the 30-minute context deadline fires. Each individual call blocks.
Symptom: "attempt #1" logged, "returned after" never logged until ~30m later.

**Mode B**: `ContainerInspect` returns quickly but reports `Running=true` for 30
minutes because the Docker daemon on Lima has stale/incorrect container state.
Symptom: rapid "returned after Xms" messages all showing Running=true.

With `DDEV_DEBUG=true` the logs will show which mode is occurring, enabling a
targeted fix.

## Candidate Fixes (to be applied once root cause is confirmed)

### If Mode A (ContainerInspect blocks):

The fix is a per-call short timeout using goroutines, since Go context cancellation
may not unblock a stuck OS-level socket read on Lima's proxy:

```go
const perCallTimeout = 10 * time.Second
for {
    ch := make(chan inspectResult, 1)
    go func() {
        callCtx, cancel := context.WithTimeout(context.Background(), perCallTimeout)
        defer cancel()
        info, err := apiClient.ContainerInspect(callCtx, c.ID, ...)
        ch <- inspectResult{info, err}
    }()
    select {
    case <-waitCtx.Done(): return timeout error
    case res := <-ch:
        if res.err == nil && !res.info.State.Running { break }
        // err or still running: fall through to tick
    }
    select {
    case <-waitCtx.Done(): return timeout error
    case <-tickChan.C:
    }
}
```

Goroutine leak is bounded (max timeout/perCallTimeout per call) and acceptable.
If the container exits and ContainerInspect subsequently hangs once, the goroutine
for that call leaks but the next call returns quickly and we proceed.

### If Mode B (stale Running=true):

The Docker daemon on Lima isn't getting container exit events. Options:
- Use `docker` CLI via `exec.CommandContext` to check state (fresh socket connection each call)
- Force-kill the container after a shorter threshold (e.g. 60s) if it's still "Running"
  but was started for a trivial command
- Investigate Lima's Docker daemon event propagation

### Other considerations

- Both failures have been seen on lima-VZ and colima-VZ builds, not on other platforms
- The commands involved are trivial: read a file, list directory contents, push traefik config
- A container running `cat file && exit` should complete in <100ms

## Manual Testing Instructions

1. On a Lima or Colima-VZ Mac: `DDEV_DEBUG=true ddev start` for a project that triggers `GetExistingDBType` or Traefik config push
2. Look for `RunSimpleContainer: ContainerInspect attempt #1` in output
3. Check if "returned after" appears immediately or only after 30m
4. Report which mode is occurring

## Automated Testing Overview

No new tests - this is a diagnostic-only change to gather information for the fix.

## Release/Deployment Notes

Debug-only logging - no behavior change. Logs only appear with `DDEV_DEBUG=true`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rfay added a commit that referenced this pull request Apr 2, 2026
…eContainer 30m hangs

## The Issue

- Related to ddev#8265 - ContainerInspect polling still produces 30m hangs on Lima/Colima-VZ

On macOS with Lima (both lima-VZ and colima-VZ), `RunSimpleContainer` calls for
trivial commands (cat a file, ls, push traefik config) are timing out at exactly
30 minutes with: "timed out after 30m0s waiting for container X to stop"

PR ddev#8265 replaced `ContainerWait` (which hung indefinitely) with `ContainerInspect`
polling and a 30-minute context deadline. The 30m timeout now fires where previously
we saw 4h hangs, but the root cause is not yet resolved.

## How This PR Solves The Issue

Adds `util.Debug` logging around the `ContainerInspect` polling loop to distinguish
between two failure modes:

**Mode A**: `ContainerInspect` blocks on the socket proxy (Lima/Colima) and doesn't
return until the 30-minute context deadline fires. Each individual call blocks.
Symptom: "attempt #1" logged, "returned after" never logged until ~30m later.

**Mode B**: `ContainerInspect` returns quickly but reports `Running=true` for 30
minutes because the Docker daemon on Lima has stale/incorrect container state.
Symptom: rapid "returned after Xms" messages all showing Running=true.

With `DDEV_DEBUG=true` the logs will show which mode is occurring, enabling a
targeted fix.

## Candidate Fixes (to be applied once root cause is confirmed)

### If Mode A (ContainerInspect blocks):

The fix is a per-call short timeout using goroutines, since Go context cancellation
may not unblock a stuck OS-level socket read on Lima's proxy:

```go
const perCallTimeout = 10 * time.Second
for {
    ch := make(chan inspectResult, 1)
    go func() {
        callCtx, cancel := context.WithTimeout(context.Background(), perCallTimeout)
        defer cancel()
        info, err := apiClient.ContainerInspect(callCtx, c.ID, ...)
        ch <- inspectResult{info, err}
    }()
    select {
    case <-waitCtx.Done(): return timeout error
    case res := <-ch:
        if res.err == nil && !res.info.State.Running { break }
        // err or still running: fall through to tick
    }
    select {
    case <-waitCtx.Done(): return timeout error
    case <-tickChan.C:
    }
}
```

Goroutine leak is bounded (max timeout/perCallTimeout per call) and acceptable.
If the container exits and ContainerInspect subsequently hangs once, the goroutine
for that call leaks but the next call returns quickly and we proceed.

### If Mode B (stale Running=true):

The Docker daemon on Lima isn't getting container exit events. Options:
- Use `docker` CLI via `exec.CommandContext` to check state (fresh socket connection each call)
- Force-kill the container after a shorter threshold (e.g. 60s) if it's still "Running"
  but was started for a trivial command
- Investigate Lima's Docker daemon event propagation

### Other considerations

- Both failures have been seen on lima-VZ and colima-VZ builds, not on other platforms
- The commands involved are trivial: read a file, list directory contents, push traefik config
- A container running `cat file && exit` should complete in <100ms

## Manual Testing Instructions

1. On a Lima or Colima-VZ Mac: `DDEV_DEBUG=true ddev start` for a project that triggers `GetExistingDBType` or Traefik config push
2. Look for `RunSimpleContainer: ContainerInspect attempt #1` in output
3. Check if "returned after" appears immediately or only after 30m
4. Report which mode is occurring

## Automated Testing Overview

No new tests - this is a diagnostic-only change to gather information for the fix.

## Release/Deployment Notes

Debug-only logging - no behavior change. Logs only appear with `DDEV_DEBUG=true`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rfay added a commit that referenced this pull request Apr 11, 2026
…ip ci]

On Windows-native, findWindowsPortProcesses returning empty was treated
as "IN USE (unable to identify)" because the isPortFree check was gated
behind !nodeps.IsWindows(). This produced false positives for every free
port. Add an explicit isPortFree check on Windows before the
"unable to identify" path so that genuinely free ports report Available.

Found during Windows native manual testing (test matrix scenario #1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rfay added a commit that referenced this pull request Apr 15, 2026
…ip ci]

On Windows-native, findWindowsPortProcesses returning empty was treated
as "IN USE (unable to identify)" because the isPortFree check was gated
behind !nodeps.IsWindows(). This produced false positives for every free
port. Add an explicit isPortFree check on Windows before the
"unable to identify" path so that genuinely free ports report Available.

Found during Windows native manual testing (test matrix scenario #1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant