Processes stuck forever #4974

jloisel · 2016-06-01T09:06:35Z

Rancher Version: v1.0.2

Docker Version: 1.11.0

OS and where are the hosts located? (cloud, bare metal, etc): Amazon AWS with Ubuntu 14.04.4LTS (3.13.0-77-generic)

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) Single node rancher, with both server and node on the same machine.

Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle

Steps to Reproduce: When rebooting the machine, rancher systematically fails to reconnect the convoy volume. I must stop the service whose container requires the convoy volume, delete the container and restart the service to create a new container. Then only then the convoy volume is bound correctly. We're using convoy 0.4.

Due to this issue, one process stuck forever is stacking up on every reboot. Some of them are suck for months now.

Results:

Rancher server logs are full of these logs:

6/1/2016 11:05:02 6/1/2016 11:05:02 6/1/2016 11:05:02 6/1/2016 11:05:02 6/1/2016 11:05:02 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM 6/1/2016 11:05:02 AM AMtime="2016-06-01T09:05:02Z" level=info msg="Purging Machine" eventId=a1407962-047e-4ed4-b59e-9dc474e22ec1 resourceId=1ph1742
AMtime="2016-06-01T09:05:02Z" level=info msg="Extracting /var/lib/cattle/machine/digital-fluid/config.json"
AMtime="2016-06-01T09:05:02Z" level=error msg="Error processing event" err="Error reinitializing config (OpenFile). Config Dir: /var/lib/cattle/machine. File: config.json. Error: open /var/lib/cattle/machine/digital-fluid/config.json: no such file or directory" eventId=a1407962-047e-4ed4-b59e-9dc474e22ec1 eventName="physicalhost.remove;handler=goMachineService" resourceId=1ph1742
AM2016-06-01 09:05:02,863 ERROR [:] [] [] [] [ServiceReplay-3] [c.p.e.p.i.DefaultProcessInstanceImpl] final ExitReason is null, should not be
AM2016-06-01 09:05:02,864 ERROR [:] [] [] [] [ServiceReplay-3] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [physicalhost.remove:118224] on [1742] java.lang.IllegalStateException: Attempt to cancel when process is still transitioning
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:191) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:158) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:108) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:74) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_101]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_101]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_101

Expected:
Get a way to kill / stop zombie processes.

pujan14 · 2016-07-11T11:10:26Z

Same here with
Rancher - 1.1.0
Cattle - 0.165.4
Docker - 1.11.2

marioapardo · 2016-07-27T07:05:29Z

Hi, command to stop process or fix?

Rancher - 1.1.2
Docker - 1.11.2

sujaisd · 2016-08-20T14:21:45Z

Same here...
Rancher 1.1.2
Docker 1.11.2

sujaisd · 2016-08-29T22:18:11Z

Is there any plan to fix it?
The zombie process keeps filling the logs on the server

CBR09 · 2016-09-11T10:32:46Z

Same with me, it's terriable and cause slow system

whiteadam · 2016-09-15T19:36:36Z

Same here. It's happened to me on several versions of Rancher, so I don't think it's a new bug. I'd like to know how to manually clear them if anyone has an idea.

zlberto · 2016-10-23T11:40:37Z

Same with me, containers stuck removing forever. There any notice about this issue?

sujaisd · 2016-10-23T11:56:59Z

I am not sure this is a proper fix or not, but it did solve my problem.

I turned off the rancher server, used the following mysql query to fix the zombie process

UPDATE process_instance SET exit_reason='DONE', end_time=NOW()  WHERE exit_reason = 'UNKNOWN_EXCEPTION';

Further for cases where there were no exit reason and never ended, I have tried this which works for all the cases.

UPDATE process_instance SET exit_reason='DONE', end_time=NOW()  WHERE end_time is NULL;

Starting rancher server again kept the log normal and the backgroun process never started for these failed ones.

Caution: This may have product side effects, and few unexpected behaviors may pop up due to manipulated data, I took the risk as the situation was bad, but a proper solution through the product should be the right one.

zlberto · 2016-10-23T12:10:58Z

The process has been removed, after run

UPDATE process_instance SET exit_reason='DONE', end_time=NOW() WHERE end_time is NULL;

But containers still removing as before run this query.

sujaisd · 2016-10-23T12:14:51Z

@Modomu - Can you confirm all your hosts are connected and you are able to reach them?
Check the hosts screen and if you see any connectivity issue this may occour.

Also there may be docker filesystem issues at times, which causes the container to never start again properly.

Your case is different than mine.

zlberto · 2016-10-23T13:13:03Z

I have only one host, and is active. When i have make deployment of the application, rancher need some tries for this. Then the containers that throws an error keep stuck as removing.

This my host screen

I need to solve this, for make the deployment of the application.

sujaisd · 2016-10-23T13:14:38Z

@Modomu - I think, you need the rancher team or an expert to look into it, I am not sure about the cause.

zlberto · 2016-10-23T13:21:15Z

Thanks @sujaisd

jloisel · 2016-10-23T19:56:02Z

It's probably difficult for the Rancher team to fix this, as the root cause of processes being stuck can be multiple. I believe they tend to work on fixing the issue which causes the process to get stuck, instead of providing a kill switch to quieten processes saying "hey there is something wrong there!".

deniseschannon · 2018-08-22T04:41:52Z

With the release of Rancher 2.0, development on v1.6 is only limited to critical bug fixes and security patches.

jloisel mentioned this issue Jun 6, 2016

"Can not start container with no image" even with image correctly set #4995

Closed

deniseschannon added kind/bug Issues that are defects reported by users or that we know have reached a real release area/storage labels Jun 20, 2016

CBR09 mentioned this issue Sep 12, 2016

Processes stuck forever cause slow system #5950

Closed

will-chan modified the milestone: Unscheduled Oct 8, 2016

demarant mentioned this issue Nov 6, 2016

volume.activate processes stuck for 6 months, rancher must not try forever on very old processes #6526

Closed

bradjones1 mentioned this issue Nov 12, 2016

Rancher stuck forever trying to delete/stop non-existent container #6608

Closed

deniseschannon added the status/autoclosed label Aug 22, 2018

deniseschannon closed this as completed Aug 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processes stuck forever #4974

Processes stuck forever #4974

jloisel commented Jun 1, 2016

pujan14 commented Jul 11, 2016

marioapardo commented Jul 27, 2016

sujaisd commented Aug 20, 2016

sujaisd commented Aug 29, 2016

CBR09 commented Sep 11, 2016

whiteadam commented Sep 15, 2016

zlberto commented Oct 23, 2016

sujaisd commented Oct 23, 2016 •

edited

Loading

zlberto commented Oct 23, 2016 •

edited

Loading

sujaisd commented Oct 23, 2016 •

edited

Loading

zlberto commented Oct 23, 2016

sujaisd commented Oct 23, 2016

zlberto commented Oct 23, 2016

jloisel commented Oct 23, 2016

deniseschannon commented Aug 22, 2018

Processes stuck forever #4974

Processes stuck forever #4974

Comments

jloisel commented Jun 1, 2016

pujan14 commented Jul 11, 2016

marioapardo commented Jul 27, 2016

sujaisd commented Aug 20, 2016

sujaisd commented Aug 29, 2016

CBR09 commented Sep 11, 2016

whiteadam commented Sep 15, 2016

zlberto commented Oct 23, 2016

sujaisd commented Oct 23, 2016 • edited Loading

zlberto commented Oct 23, 2016 • edited Loading

sujaisd commented Oct 23, 2016 • edited Loading

zlberto commented Oct 23, 2016

sujaisd commented Oct 23, 2016

zlberto commented Oct 23, 2016

jloisel commented Oct 23, 2016

deniseschannon commented Aug 22, 2018

sujaisd commented Oct 23, 2016 •

edited

Loading

zlberto commented Oct 23, 2016 •

edited

Loading

sujaisd commented Oct 23, 2016 •

edited

Loading