Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes stuck forever #4974

Closed
jloisel opened this issue Jun 1, 2016 · 15 comments
Closed

Processes stuck forever #4974

jloisel opened this issue Jun 1, 2016 · 15 comments
Labels
area/storage kind/bug Issues that are defects reported by users or that we know have reached a real release status/autoclosed

Comments

@jloisel
Copy link

jloisel commented Jun 1, 2016

Rancher Version: v1.0.2

Docker Version: 1.11.0

OS and where are the hosts located? (cloud, bare metal, etc): Amazon AWS with Ubuntu 14.04.4LTS (3.13.0-77-generic)

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) Single node rancher, with both server and node on the same machine.

Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle

Steps to Reproduce: When rebooting the machine, rancher systematically fails to reconnect the convoy volume. I must stop the service whose container requires the convoy volume, delete the container and restart the service to create a new container. Then only then the convoy volume is bound correctly. We're using convoy 0.4.

Due to this issue, one process stuck forever is stacking up on every reboot. Some of them are suck for months now.

Results:
rancher-processes-stuck

Rancher server logs are full of these logs:

6/1/2016 11:05:02 AMtime="2016-06-01T09:05:02Z" level=info msg="Purging Machine" eventId=a1407962-047e-4ed4-b59e-9dc474e22ec1 resourceId=1ph1742
6/1/2016 11:05:02 AMtime="2016-06-01T09:05:02Z" level=info msg="Extracting /var/lib/cattle/machine/digital-fluid/config.json"
6/1/2016 11:05:02 AMtime="2016-06-01T09:05:02Z" level=error msg="Error processing event" err="Error reinitializing config (OpenFile). Config Dir: /var/lib/cattle/machine. File: config.json. Error: open /var/lib/cattle/machine/digital-fluid/config.json: no such file or directory" eventId=a1407962-047e-4ed4-b59e-9dc474e22ec1 eventName="physicalhost.remove;handler=goMachineService" resourceId=1ph1742
6/1/2016 11:05:02 AM2016-06-01 09:05:02,863 ERROR [:] [] [] [] [ServiceReplay-3] [c.p.e.p.i.DefaultProcessInstanceImpl] final ExitReason is null, should not be
6/1/2016 11:05:02 AM2016-06-01 09:05:02,864 ERROR [:] [] [] [] [ServiceReplay-3] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [physicalhost.remove:118224] on [1742] java.lang.IllegalStateException: Attempt to cancel when process is still transitioning
6/1/2016 11:05:02 AM at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:191) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:158) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:108) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:74) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
6/1/2016 11:05:02 AM at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_101]
6/1/2016 11:05:02 AM at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_101]
6/1/2016 11:05:02 AM at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_101]
6/1/2016 11:05:02 AM at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_101]
6/1/2016 11:05:02 AM at java.lang.Thread.run(Thread.java:745) [na:1.7.0_101

Expected:
Get a way to kill / stop zombie processes.

@deniseschannon deniseschannon added kind/bug Issues that are defects reported by users or that we know have reached a real release area/storage labels Jun 20, 2016
@pujan14
Copy link

pujan14 commented Jul 11, 2016

Same here with
Rancher - 1.1.0
Cattle - 0.165.4
Docker - 1.11.2

@marioapardo
Copy link

Hi, command to stop process or fix?

Rancher - 1.1.2
Docker - 1.11.2

screenshot at 27-07-2016_02 02 29

@sujaisd
Copy link

sujaisd commented Aug 20, 2016

Same here...
Rancher 1.1.2
Docker 1.11.2

screen shot 2016-08-20 at 7 50 22 pm

@sujaisd
Copy link

sujaisd commented Aug 29, 2016

Is there any plan to fix it?
The zombie process keeps filling the logs on the server

@CBR09
Copy link

CBR09 commented Sep 11, 2016

Same with me, it's terriable and cause slow system

@whiteadam
Copy link

Same here. It's happened to me on several versions of Rancher, so I don't think it's a new bug. I'd like to know how to manually clear them if anyone has an idea.

@will-chan will-chan modified the milestone: Unscheduled Oct 8, 2016
@zlberto
Copy link

zlberto commented Oct 23, 2016

Same with me, containers stuck removing forever. There any notice about this issue?

1tx7bkwvjmyaaaaasuvork5cyii

@sujaisd
Copy link

sujaisd commented Oct 23, 2016

I am not sure this is a proper fix or not, but it did solve my problem.

I turned off the rancher server, used the following mysql query to fix the zombie process

UPDATE process_instance SET exit_reason='DONE', end_time=NOW()  WHERE exit_reason = 'UNKNOWN_EXCEPTION';

Further for cases where there were no exit reason and never ended, I have tried this which works for all the cases.

UPDATE process_instance SET exit_reason='DONE', end_time=NOW()  WHERE end_time is NULL;

Starting rancher server again kept the log normal and the backgroun process never started for these failed ones.

Caution: This may have product side effects, and few unexpected behaviors may pop up due to manipulated data, I took the risk as the situation was bad, but a proper solution through the product should be the right one.

@zlberto
Copy link

zlberto commented Oct 23, 2016

The process has been removed, after run

UPDATE process_instance SET exit_reason='DONE', end_time=NOW() WHERE end_time is NULL;

But containers still removing as before run this query.

47tezaaaaaelftksuqmcc

@sujaisd
Copy link

sujaisd commented Oct 23, 2016

@Modomu - Can you confirm all your hosts are connected and you are able to reach them?
Check the hosts screen and if you see any connectivity issue this may occour.

Also there may be docker filesystem issues at times, which causes the container to never start again properly.

Your case is different than mine.

@zlberto
Copy link

zlberto commented Oct 23, 2016

I have only one host, and is active. When i have make deployment of the application, rancher need some tries for this. Then the containers that throws an error keep stuck as removing.

This my host screen
lubxrqpwrogaaaabjru5erkjggg

I need to solve this, for make the deployment of the application.

@sujaisd
Copy link

sujaisd commented Oct 23, 2016

@Modomu - I think, you need the rancher team or an expert to look into it, I am not sure about the cause.

@zlberto
Copy link

zlberto commented Oct 23, 2016

Thanks @sujaisd

@jloisel
Copy link
Author

jloisel commented Oct 23, 2016

It's probably difficult for the Rancher team to fix this, as the root cause of processes being stuck can be multiple. I believe they tend to work on fixing the issue which causes the process to get stuck, instead of providing a kill switch to quieten processes saying "hey there is something wrong there!".

@deniseschannon
Copy link

With the release of Rancher 2.0, development on v1.6 is only limited to critical bug fixes and security patches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage kind/bug Issues that are defects reported by users or that we know have reached a real release status/autoclosed
Projects
None yet
Development

No branches or pull requests

9 participants