Skip to content
This repository has been archived by the owner on Sep 21, 2021. It is now read-only.

Cleanup Thread "crashes" #1156

Closed
jannickHo opened this issue Feb 13, 2020 · 1 comment
Closed

Cleanup Thread "crashes" #1156

jannickHo opened this issue Feb 13, 2020 · 1 comment

Comments

@jannickHo
Copy link

🐛 Bug Report

After some time(it varies from 1 - 30 days) the zalenium hub stops to cleanup its nodes.

To Reproduce

Cant be reproduced/triggered manually.

Expected behavior

After a test run, the node should be cleaned up and removed by the hub

Environment

OS: centos 7.6.1810
Zalenium Image Version(s): 3.141.59v
Docker version: Docker version 18.09.1, build 4c52b90
Docker-Compose version (if applicable): docker-compose version 1.23.2, build 1110ad01
Usage: two sequences every 5 minutes
Exact Docker command to start the containers:

  zalenium:
    privileged: True
    image: dosel/zalenium:3.141.59v
    container_name: zalenium
    restart: always
    command: "start --maxDockerSeleniumContainers 10 --desiredContainers 5 --debugEnabled true"
    ports:
      - 4444:4444
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
      - "/tmp/videos:/home/seluser/videos"
    environment:
      - CHECK_CONTAINERS_INTERVAL=1000
      - ZALENIUM_EXTRA_JVM_PARAMS=-Xmx10g

Detailed Description:

After analysing the debug logs I suspect the "copying files" to be the cause/trigger.
Befor the cleanup process stops working you have following log entries several times:

21:19:43.742 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Checking containers...
	21:19:43.742 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Checking 5 containers.
	21:19:44.749 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Checking containers...
	21:19:44.749 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Checking 5 containers.

Then the "copying files" throws an warning (the video from that node has been copied allready):

21:19:44.493 [http://172.30.0.11:40761] WARN  d.z.e.z.p.DockerSeleniumRemoteProxy - Error while copying the video
	java.nio.file.FileAlreadyExistsException: /home/seluser/videos/zalenium_SELPWKESL01_GRID_firefox_LINUX_20200101211850539_COMPLETED.mp4
			at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
			at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
			at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
			at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
			at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
			at java.nio.file.Files.newOutputStream(Files.java:216)
			at java.nio.file.Files.copy(Files.java:3016)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.copyVideos(DockerSeleniumRemoteProxy.java:644)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.processContainerAction(DockerSeleniumRemoteProxy.java:614)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.processContainerAction(DockerSeleniumRemoteProxy.java:601)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.videoRecording(DockerSeleniumRemoteProxy.java:565)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.cleanupNode(DockerSeleniumRemoteProxy.java:759)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.afterSession(DockerSeleniumRemoteProxy.java:400)
			at org.openqa.grid.internal.TestSlot.performAfterSessionEvent(TestSlot.java:253)
			at de.zalando.ep.zalenium.registry.ZaleniumRegistry._release(ZaleniumRegistry.java:193)
			at de.zalando.ep.zalenium.registry.ZaleniumRegistry.lambda$0(ZaleniumRegistry.java:177)
			at java.lang.Thread.run(Thread.java:748)

After that warning no "checking containers" log entries pop up anymore.

And here the logs for a node:

16:29:26.788 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001 video-rec: stopped
	--40001LOG 16:29:23:986912588 Trapped SIGTERM or SIGINT so shutting down ffmpeg gracefully...
	--40001LOG 16:29:23:991593194 Will kill VID_TOOL_PID=647 ...
	--40001LOG 16:29:24:998601310 Tried to kill -SIGTERM VID_TOOL_PID=647
	--40001LOG 16:29:25:003496489 Waiting up to 6s for VID_TOOL_PID=647 to end with SIGTERM...
	--40001LOG 16:29:25:012840069 wait_pid successfully managed to SIGTERM:VID_TOOL_PID=647 within less than 6s
	--40001LOG 16:29:25:017483677 Will try to fix the videos...
	--40001LOG 16:29:25:530947578 Fixing perms for /home/seluser/videos/vid_chrome_or_firefox_40001.mkv*
	--40001LOG 16:29:25:542825162 Changing video encoding from mkv to mp4...
	--40001LOG 16:29:26:467943099 Conversion from mkv to mp4 succeeded!
	--40001LOG 16:29:26:472634368 Cleaning up /home/seluser/videos/vid_chrome_or_firefox_40001.mkv ...
	--40001LOG 16:29:26:479091184 Optimizing /home/seluser/videos/vid_chrome_or_firefox_40001.mp4 for HTTP streaming...
	--40001LOG 16:29:26:500525487 Succeeded to mp4box_retry.sh within 8s
	--40001LOG 16:29:26:505331225 ffmpeg shutdown complete.
	Input #0, matroska,webm, from '/home/seluser/videos/vid_chrome_or_firefox_40001.mkv':
	Output #0, mp4, to '/home/seluser/videos/vid_chrome_or_firefox_40001.mp4':
	16:29:26.940 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Video file copied to: /home/seluser/videos/zalenium_SELPWKESL01_GRID_firefox_LINUX_20191120162848196_COMPLETED.mp4
	16:29:26.944 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001 [bash, -c, transfer-logs.sh]
	16:29:27.052 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001 bash: transfer-logs.sh: command not found
	16:29:27.110 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Logs copied to: /home/seluser/videos/logs/zalenium_SELPWKESL01_GRID_firefox_LINUX_20191120162848196_COMPLETED
	16:29:27.116 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001 [bash, -c, cleanup-container.sh]
	16:29:27.277 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001

this part is not happening after the above descriped error which normally does for "working" nodes:

	16:29:27.943 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Proxy has reached max test sessions.
	16:29:27.943 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Shutting down node due to proxy has reached max test sessions.
	16:29:30.348 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Marking the node as down because it was stopped after 1 tests.
	16:29:30.348 [http://172.27.0.11:40001] DEBUG d.z.e.z.registry.ZaleniumRegistry - Cleaning up stale test sessions on the unregistered node http://172.27.0.11:40001
	16:29:30.348 [http://172.27.0.11:40001] DEBUG d.z.e.z.proxy.AutoStartProxySet - Stopping removed container [http://172.27.0.11:40001
	16:29:30.352 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - stopPolling() deactivated
	16:29:30.994 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.c.DockerContainerClient - Container ContainerCreationStatus [isCreated=true, containerName=zalenium_bpjaWa, containerId=e22214646faed56e2016a86b53b8955bf1e3edb21bb4036630091119cc5c49b3, nodePort=40001] not found - it is terminated.

Workaround

Restarting the hub

Possible Solution

surround the copying file command with an if "file exists" and not only if the folder exists (DockerSeleniumRemoteProxy:658)

diemol added a commit that referenced this issue Mar 21, 2020
diemol added a commit that referenced this issue Mar 21, 2020
@diemol
Copy link
Contributor

diemol commented Mar 21, 2020

Followed your suggestion and should be fixed by #1165

@diemol diemol closed this as completed Mar 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants