Some integration tests hanging on dashboard PRs #398

bobcatfish · 2020-05-20T20:17:06Z

Expected Behavior

Prow Integration test statuses should always eventually get updated instead of just hanging indefintely.

Actual Behavior

Notably in tektoncd/dashboard#1406 and tektoncd/dashboard#1403 the integration tests would just hang. Symptoms:

Always "pending" state, in spite of how many /test commands issued
When you click "details", it says the build is running and always shows only 36 lines of output

You can find the underlying pods by looking at the "hook" logs (see prow architecture).

The pods in question were stuck initializing, until eventually they'd disappear, apparently cleaned up but something, but the status was never updated.

For example:

(⎈ |euca:default)➜  Downloads kubectl --context prow get pod c90a44ca-9abe-11ea-b60d-a22e91c6f8b8
NAME                                   READY   STATUS     RESTARTS   AGE
c90a44ca-9abe-11ea-b60d-a22e91c6f8b8   0/2     Init:0/3   0          17m

Running a kubectl describe looks like:

  Type     Reason                  Age                  From                                          Message
  ----     ------                  ----                 ----                                          -------
  Normal   Scheduled               21m                  default-scheduler                             Successfully assigned default/c90a44ca-9abe-11ea-b60d-a22e91c6f8b8 to gke-prow-highmem-pool-45b2fab2-01f9
  Warning  FailedCreatePodSandBox  2m35s (x8 over 18m)  kubelet, gke-prow-highmem-pool-45b2fab2-01f9  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "c90a44ca-9abe-11ea-b60d-a22e91c6f8b8": operation timeout: context deadline exceeded
  Warning  FailedSync              7s (x11 over 2m21s)  kubelet, gke-prow-highmem-pool-45b2fab2-01f9  error determining status: rpc error: code = Unknown desc = Error: No such container: 287318f1a2d49db1df7c2737e3ef3a3ca0d721022cc15e91a827bff1c8bb5093

The text was updated successfully, but these errors were encountered:

bobcatfish · 2020-05-20T20:23:14Z

Googling for failed to create a sandbox for pod "c90a44ca-9abe-11ea-b60d-a22e91c6f8b8": operation timeout: context deadline exceeded led me to kubernetes/kubernetes#79451 so I got the bright idea that maybe if I update the nodes, they will come with a fix for this issue.

However, the fact that the FailedCreatePodSandBox error is repeating seems to hint that it is retrying, and also it looks like 1.14.4 has the release https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.14.md#v1144 - the version of the nodes is somewhat lost in the sands of time but i think it was at least 1.14.10.

Anyway I upgraded the nodes anyway and now it seems like things are working (@eddycharly plz reopen if I'm wrong) so either that fixed something or whatever it was stopped happening (for now).

(When I looked at the node that this was running on, it looked like it had recently restarted so maybe that was somehow the problem).

Looking at https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/directory/pull-tekton-dashboard-integration-tests? The last few have actually completed, tho you can see all the ones that never got updated as well:

I'm gonna close this for now 🤞 but we can reopen if this keeps happening.

bobcatfish closed this as completed May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some integration tests hanging on dashboard PRs #398

Some integration tests hanging on dashboard PRs #398

bobcatfish commented May 20, 2020

bobcatfish commented May 20, 2020

Some integration tests hanging on dashboard PRs #398

Some integration tests hanging on dashboard PRs #398

Comments

bobcatfish commented May 20, 2020

Expected Behavior

Actual Behavior

bobcatfish commented May 20, 2020