Skip to content
This repository has been archived by the owner on Mar 17, 2021. It is now read-only.

Provide a better error message when PV failed to be attached #557

Closed
l0rd opened this issue Mar 5, 2018 · 10 comments
Closed

Provide a better error message when PV failed to be attached #557

l0rd opened this issue Mar 5, 2018 · 10 comments
Assignees

Comments

@l0rd
Copy link
Contributor

l0rd commented Mar 5, 2018

This is the message that is shown to the user:
PV error

In che-server logs there are no details:

{"@timestamp":"2018-02-01T14:56:20.083+00:00","@version":1,"message":"Failed to get the ID of the container running in the OpenShift pod","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceManager","thread_name":"WorkspaceSharedPool-5","level":"ERROR","level_value":40000,"stack_trace":"org.eclipse.che.api.core.ServerException: Failed to get the ID of the container running in the OpenShift pod
	at org.eclipse.che.api.environment.server.CheEnvironmentEngine.startInstance(CheEnvironmentEngine.java:1005)
	at org.eclipse.che.api.environment.server.CheEnvironmentEngine.startEnvironmentQueue(CheEnvironmentEngine.java:797)
	at org.eclipse.che.api.environment.server.CheEnvironmentEngine.start(CheEnvironmentEngine.java:263)
	at org.eclipse.che.api.workspace.server.WorkspaceRuntimes.startEnvironmentAndPublishEvents(WorkspaceRuntimes.java:694)
	at org.eclipse.che.api.workspace.server.WorkspaceRuntimes.access$100(WorkspaceRuntimes.java:103)
	at org.eclipse.che.api.workspace.server.WorkspaceRuntimes$StartTask.call(WorkspaceRuntimes.java:986)
	at org.eclipse.che.api.workspace.server.WorkspaceRuntimes$StartTask.call(WorkspaceRuntimes.java:948)
	at org.eclipse.che.commons.lang.concurrent.CopyThreadLocalCallable.call(CopyThreadLocalCallable.java:31)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.eclipse.che.api.core.ServerException: Failed to get the ID of the container running in the OpenShift pod
	at org.eclipse.che.plugin.docker.machine.MachineProviderImpl.startService(MachineProviderImpl.java:401)
	at org.eclipse.che.api.environment.server.CheEnvironmentEngine.lambda$startEnvironmentQueue$6(CheEnvironmentEngine.java:768)
	at org.eclipse.che.api.environment.server.CheEnvironmentEngine.startInstance(CheEnvironmentEngine.java:942)
	... 11 common frames omitted
Caused by: org.eclipse.che.plugin.openshift.client.exception.OpenShiftException: Failed to get the ID of the container running in the OpenShift pod
	at org.eclipse.che.plugin.openshift.client.OpenShiftConnector.createContainer(OpenShiftConnector.java:706)
	at org.eclipse.che.plugin.docker.machine.MachineProviderImpl.createContainer(MachineProviderImpl.java:648)
	at org.eclipse.che.plugin.docker.machine.MachineProviderImpl.startService(MachineProviderImpl.java:357)
	... 13 common frames omitted
"}

But in the OpenShift event logs we can see that the problem is "Unable to mount volumes for pod"

failed

@l0rd
Copy link
Contributor Author

l0rd commented Mar 5, 2018

We need to verify how Che 6 behaves. Maybe the error is better handled on Che 6.

@ibuziuk
Copy link
Member

ibuziuk commented Apr 5, 2018

Was able to reproduce the issue by starting pod and executing blocking operation before creating workspace:

apiVersion: "v1"
kind: "Pod"
metadata:
  name: "mypod"
  labels:
    name: "mypod"
spec:
  containers:
    -
      name: "mypod"
      image: "registry.access.redhat.com/rhel7-atomic"
      command:
        - /bin/bash
        - '-c'
        - >-
          echo "Sleeping for a long time..."; sleep 10000; echo "Sleep finished"
      ports:
        -
          containerPort: 80
          name: "http-server"
      volumeMounts:
        -
          mountPath: "/projects"
          subPath: "worksworkspacecf3x7fobl5q5ults/projects"
          name: "claim-che-workspace"
  volumes:
    -
      name: "claim-che-workspace"
      persistentVolumeClaim:
        claimName: "claim-che-workspace"

Openshift events:

image

Che server logs:

{"@timestamp":"2018-04-05T16:53:23.286+00:00","@version":1,"message":"Workspace 'ibuziuk@redhat.com/nbbiz' with id 'workspaceaicvnnmikpxrw1sa' created by user 'ibuziuk@redhat.com'","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceManager","thread_name":"http-nio-8080-exec-8","level":"INFO","level_value":20000,"req_id":"6f9dadae-135a-43bf-82c1-ebd5a6eeb175","identity_id":"4f871a"}
--
  | {"@timestamp":"2018-04-05T16:53:25.335+00:00","@version":1,"message":"Starting workspace 'ibuziuk@redhat.com/nbbiz' with id 'workspaceaicvnnmikpxrw1sa' by user 'ibuziuk@redhat.com'","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceRuntimes","thread_name":"http-nio-8080-exec-5","level":"INFO","level_value":20000,"req_id":"6f9dadae-135a-43bf-82c1-ebd5a6eeb175","identity_id":"4f871a"}
  | {"@timestamp":"2018-04-05T17:00:05.067+00:00","@version":1,"message":"Current reconnect backoff is 1000 milliseconds (T0)","logger_name":"io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager","thread_name":"scheduleReconnect\|Executor for Watch 9692063","level":"INFO","level_value":20000}
  | {"@timestamp":"2018-04-05T17:01:25.756+00:00","@version":1,"message":"Failed to start Kubernetes runtime of workspace workspaceaicvnnmikpxrw1sa. Cause: Waiting for Kubernetes environment 'default' of the workspace'workspaceaicvnnmikpxrw1sa' reached timeout","logger_name":"org.eclipse.che.workspace.infrastructure.kubernetes.KubernetesInternalRuntime","thread_name":"WorkspaceSharedPool-6","level":"WARN","level_value":30000,"req_id":"6f9dadae-135a-43bf-82c1-ebd5a6eeb175","identity_id":"71a"}
  | {"@timestamp":"2018-04-05T17:01:26.372+00:00","@version":1,"message":"Workspace 'ibuziuk@redhat.com:nbbiz' with id 'workspaceaicvnnmikpxrw1sa' start failed","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceRuntimes","thread_name":"WorkspaceSharedPool-6","level":"INFO","level_value":20000,"req_id":"6f9dadae-135a-43bf-82c1-ebd5a6eeb175","identity_id":"54f871a"}

image

BTW, this approach with mounting volume in dummy pod & executing blocking operation does not reproduce the problem on minishift (It is possible to mount the same volume from different pods without any problems)

@garagatyi
Copy link

BTW, this approach with mounting volume in dummy pod & executing blocking operation does not reproduce the problem on minishift (It is possible to mount the same volume from different pods without any problems)

Minishift is a single node installation. You can mount the same volume to any amount of pods there.

@ghost
Copy link

ghost commented Apr 7, 2018

We used to have events in machine logs in Che 6. @akorneta @garagatyi should we maybe bring them back? It is a critical piece for Che 6 as on prem too. Makes support life easier.

@ibuziuk
Copy link
Member

ibuziuk commented Apr 7, 2018

hmmm... I did not know about that. Why openshift events were removed ?

@ibuziuk
Copy link
Member

ibuziuk commented Apr 9, 2018

it looks like publishing openshift events & tracking pod "Failed" state were removed as part of the following PR [1]. Related upstream issue for bringing this functionality back [2]

[1] eclipse-che/che#7629
[2] eclipse-che/che#7653

@ghost
Copy link

ghost commented Apr 9, 2018

Raised priority on this one

@ibuziuk
Copy link
Member

ibuziuk commented Apr 9, 2018

@eivantsov thansk, those issues are coupled but not fully related. e.g. AbnormalStopHandler will not be able to handle volume mount / attach case since it tracks only the state of the pod and throws exception only when pod has "Failed" status. In case of Failed Attach Volume / Failed Mount the status of pod is still "Container Creating", Looks like there should be some UnrecoverableEventHandler which would track unrecoverable events and fail fast if the event occurs before timeout

@ghost
Copy link

ghost commented Apr 9, 2018

@ibuziuk I see.

Yes, the idea is to get openshift events related to a ws pod - pulling image, container created etc. Smth that you have in OpenShift console and smth that is helpful to understand why a pod failed to start.

@akorneta @sleshchenko you are welcome to share your thoughts

@ibuziuk
Copy link
Member

ibuziuk commented Apr 19, 2018

PR with fix is merged to master eclipse-che/che#9426
The issue is going to be closed once fix would be on osio

@l0rd l0rd closed this as completed May 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants