Cleanup Thread "crashes" #1156

jannickHo · 2020-02-13T08:24:23Z

🐛 Bug Report

After some time(it varies from 1 - 30 days) the zalenium hub stops to cleanup its nodes.

To Reproduce

Cant be reproduced/triggered manually.

Expected behavior

After a test run, the node should be cleaned up and removed by the hub

Environment

OS: centos 7.6.1810
Zalenium Image Version(s): 3.141.59v
Docker version: Docker version 18.09.1, build 4c52b90
Docker-Compose version (if applicable): docker-compose version 1.23.2, build 1110ad01
Usage: two sequences every 5 minutes
Exact Docker command to start the containers:

  zalenium:
    privileged: True
    image: dosel/zalenium:3.141.59v
    container_name: zalenium
    restart: always
    command: "start --maxDockerSeleniumContainers 10 --desiredContainers 5 --debugEnabled true"
    ports:
      - 4444:4444
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
      - "/tmp/videos:/home/seluser/videos"
    environment:
      - CHECK_CONTAINERS_INTERVAL=1000
      - ZALENIUM_EXTRA_JVM_PARAMS=-Xmx10g

Detailed Description:

After analysing the debug logs I suspect the "copying files" to be the cause/trigger.
Befor the cleanup process stops working you have following log entries several times:

21:19:43.742 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Checking containers...
	21:19:43.742 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Checking 5 containers.
	21:19:44.749 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Checking containers...
	21:19:44.749 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.proxy.AutoStartProxySet - Checking 5 containers.

Then the "copying files" throws an warning (the video from that node has been copied allready):

21:19:44.493 [http://172.30.0.11:40761] WARN  d.z.e.z.p.DockerSeleniumRemoteProxy - Error while copying the video
	java.nio.file.FileAlreadyExistsException: /home/seluser/videos/zalenium_SELPWKESL01_GRID_firefox_LINUX_20200101211850539_COMPLETED.mp4
			at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
			at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
			at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
			at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
			at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
			at java.nio.file.Files.newOutputStream(Files.java:216)
			at java.nio.file.Files.copy(Files.java:3016)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.copyVideos(DockerSeleniumRemoteProxy.java:644)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.processContainerAction(DockerSeleniumRemoteProxy.java:614)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.processContainerAction(DockerSeleniumRemoteProxy.java:601)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.videoRecording(DockerSeleniumRemoteProxy.java:565)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.cleanupNode(DockerSeleniumRemoteProxy.java:759)
			at de.zalando.ep.zalenium.proxy.DockerSeleniumRemoteProxy.afterSession(DockerSeleniumRemoteProxy.java:400)
			at org.openqa.grid.internal.TestSlot.performAfterSessionEvent(TestSlot.java:253)
			at de.zalando.ep.zalenium.registry.ZaleniumRegistry._release(ZaleniumRegistry.java:193)
			at de.zalando.ep.zalenium.registry.ZaleniumRegistry.lambda$0(ZaleniumRegistry.java:177)
			at java.lang.Thread.run(Thread.java:748)

After that warning no "checking containers" log entries pop up anymore.

And here the logs for a node:

16:29:26.788 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001 video-rec: stopped
	--40001LOG 16:29:23:986912588 Trapped SIGTERM or SIGINT so shutting down ffmpeg gracefully...
	--40001LOG 16:29:23:991593194 Will kill VID_TOOL_PID=647 ...
	--40001LOG 16:29:24:998601310 Tried to kill -SIGTERM VID_TOOL_PID=647
	--40001LOG 16:29:25:003496489 Waiting up to 6s for VID_TOOL_PID=647 to end with SIGTERM...
	--40001LOG 16:29:25:012840069 wait_pid successfully managed to SIGTERM:VID_TOOL_PID=647 within less than 6s
	--40001LOG 16:29:25:017483677 Will try to fix the videos...
	--40001LOG 16:29:25:530947578 Fixing perms for /home/seluser/videos/vid_chrome_or_firefox_40001.mkv*
	--40001LOG 16:29:25:542825162 Changing video encoding from mkv to mp4...
	--40001LOG 16:29:26:467943099 Conversion from mkv to mp4 succeeded!
	--40001LOG 16:29:26:472634368 Cleaning up /home/seluser/videos/vid_chrome_or_firefox_40001.mkv ...
	--40001LOG 16:29:26:479091184 Optimizing /home/seluser/videos/vid_chrome_or_firefox_40001.mp4 for HTTP streaming...
	--40001LOG 16:29:26:500525487 Succeeded to mp4box_retry.sh within 8s
	--40001LOG 16:29:26:505331225 ffmpeg shutdown complete.
	Input #0, matroska,webm, from '/home/seluser/videos/vid_chrome_or_firefox_40001.mkv':
	Output #0, mp4, to '/home/seluser/videos/vid_chrome_or_firefox_40001.mp4':
	16:29:26.940 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Video file copied to: /home/seluser/videos/zalenium_SELPWKESL01_GRID_firefox_LINUX_20191120162848196_COMPLETED.mp4
	16:29:26.944 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001 [bash, -c, transfer-logs.sh]
	16:29:27.052 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001 bash: transfer-logs.sh: command not found
	16:29:27.110 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Logs copied to: /home/seluser/videos/logs/zalenium_SELPWKESL01_GRID_firefox_LINUX_20191120162848196_COMPLETED
	16:29:27.116 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001 [bash, -c, cleanup-container.sh]
	16:29:27.277 [http://172.27.0.11:40001] DEBUG d.z.e.z.c.DockerContainerClient - http://172.27.0.11:40001

this part is not happening after the above descriped error which normally does for "working" nodes:

	16:29:27.943 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Proxy has reached max test sessions.
	16:29:27.943 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Shutting down node due to proxy has reached max test sessions.
	16:29:30.348 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - Marking the node as down because it was stopped after 1 tests.
	16:29:30.348 [http://172.27.0.11:40001] DEBUG d.z.e.z.registry.ZaleniumRegistry - Cleaning up stale test sessions on the unregistered node http://172.27.0.11:40001
	16:29:30.348 [http://172.27.0.11:40001] DEBUG d.z.e.z.proxy.AutoStartProxySet - Stopping removed container [http://172.27.0.11:40001
	16:29:30.352 [http://172.27.0.11:40001] DEBUG d.z.e.z.p.DockerSeleniumRemoteProxy - stopPolling() deactivated
	16:29:30.994 [AutoStartProxyPoolPoller] DEBUG d.z.e.z.c.DockerContainerClient - Container ContainerCreationStatus [isCreated=true, containerName=zalenium_bpjaWa, containerId=e22214646faed56e2016a86b53b8955bf1e3edb21bb4036630091119cc5c49b3, nodePort=40001] not found - it is terminated.

Workaround

Restarting the hub

Possible Solution

surround the copying file command with an if "file exists" and not only if the folder exists (DockerSeleniumRemoteProxy:658)

The text was updated successfully, but these errors were encountered:

… current flow

… current flow (#1165)

diemol · 2020-03-21T21:48:36Z

Followed your suggestion and should be fixed by #1165

diemol added a commit that referenced this issue Mar 21, 2020

Following #1156 suggestion, maybe it fixes the issue, should not hurt…

6a57515

… current flow

diemol added a commit that referenced this issue Mar 21, 2020

Following #1156 suggestion, maybe it fixes the issue, should not hurt…

c33b808

… current flow (#1165)

diemol closed this as completed Mar 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup Thread "crashes" #1156

Cleanup Thread "crashes" #1156

jannickHo commented Feb 13, 2020

diemol commented Mar 21, 2020

Cleanup Thread "crashes" #1156

Cleanup Thread "crashes" #1156

Comments

jannickHo commented Feb 13, 2020

🐛 Bug Report

To Reproduce

Expected behavior

Environment

Detailed Description:

Workaround

Possible Solution

diemol commented Mar 21, 2020