Skip to content
This repository has been archived by the owner on Sep 21, 2021. It is now read-only.

Zalenium versions that use Selenium 3.11 intermittently fail re: "No System TLS" #491

Closed
borisivan opened this issue Mar 20, 2018 · 45 comments

Comments

@borisivan
Copy link

borisivan commented Mar 20, 2018

Zalenium Image Version(s):
All versions that use Selenium 3.11

Docker Version:
17.09, others

Docker Command to start Zalenium:
docker, run, -tid, --name=obfuscated, --net=host, -p, 4444:4444, -v, /var/run/docker.sock:/var/run/docker.sock, -v, /usr/bin/docker:/usr/bin/docker, -v, /dev/shm:/dev/shm, -v, /tmp/videos/obfuscated:/home/seluser/videos, -v, /tmp/Downloads/obfuscated:/tmp/node/home/seluser/Downloads, --privileged, dosel/zalenium:latest, start, --chromeContainers, 7, --screenWidth, 1920, --seleniumImageName, elgalu/selenium:latest, --screenHeight, 1080, --firefoxContainers, 0, --keepOnlyFailedTests, true, --videoRecordingEnabled, true, --debugEnabled, false, --maxTestSessions, 150

Expected Behavior -

With 6 tests running in parallel across those 7 static containers, tests run normally. (same thing happens w/ 6 containers, I have 7 to help re: availability during cleanup)

Actual Behavior -

Out of ~240 tests, I will have 1-5 fail with the following footprint:

org.openqa.selenium.WebDriverException: cannot forward the request No System TLS
Command duration or timeout: 0 milliseconds
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.openqa.selenium.remote.ErrorHandler.createThrowable(ErrorHandler.java:214)
	at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed(ErrorHandler.java:166)
	at org.openqa.selenium.remote.http.JsonHttpResponseCodec.reconstructValue(JsonHttpResponseCodec.java:40)
	at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode(AbstractHttpResponseCodec.java:80)
	at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode(AbstractHttpResponseCodec.java:44)
	at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:160)
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:601)
	at org.openqa.selenium.remote.RemoteWebDriver.findElements(RemoteWebDriver.java:398)
	at org.openqa.selenium.remote.RemoteWebDriver.findElementsByClassName(RemoteWebDriver.java:461)
	at org.openqa.selenium.By$ByClassName.findElements(By.java:383)
	at org.openqa.selenium.remote.RemoteWebDriver.findElements(RemoteWebDriver.java:359)
	at (number of frames obfuscated)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
	at org.testng.internal.MethodInvocationHelper$1.runTestMethod(MethodInvocationHelper.java:230)
	at org.springframework.test.context.testng.AbstractTestNGSpringContextTests.run(AbstractTestNGSpringContextTests.java:175)
	at org.testng.internal.MethodInvocationHelper.invokeHookable(MethodInvocationHelper.java:242)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:56)
	at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:44)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.openqa.selenium.WebDriverException: cannot forward the request No System TLS
Build info: version: '3.10.0', revision: '176b4a9', time: '2018-03-02T19:03:16.397Z'
System info: host: 'ubuntu1604', ip: '10.141.56.42', os.name: 'Linux', os.arch: 'amd64', os.version: '4.15.5-041505-generic', java.version: '1.8.0_151'
Driver info: driver.version: unknown
	at org.openqa.grid.web.servlet.handler.RequestHandler.process(RequestHandler.java:143)
	at org.openqa.grid.web.servlet.DriverServlet.process(DriverServlet.java:86)
	at org.openqa.grid.web.servlet.DriverServlet.doPost(DriverServlet.java:70)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.seleniumhq.jetty9.servlet.ServletHolder.handle(ServletHolder.java:860)
	at org.seleniumhq.jetty9.servlet.ServletHandler.doHandle(ServletHandler.java:535)
	at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
	at org.seleniumhq.jetty9.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
	at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
	at org.seleniumhq.jetty9.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
	at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
	at org.seleniumhq.jetty9.servlet.ServletHandler.doScope(ServletHandler.java:473)
	at org.seleniumhq.jetty9.server.session.SessionHandler.doScope(SessionHandler.java:1564)
	at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
	at org.seleniumhq.jetty9.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
	at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.seleniumhq.jetty9.server.Server.handle(Server.java:530)
	at org.seleniumhq.jetty9.server.HttpChannel.handle(HttpChannel.java:347)
	at org.seleniumhq.jetty9.server.HttpConnection.onFillable(HttpConnection.java:256)
	at org.seleniumhq.jetty9.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
	at org.seleniumhq.jetty9.io.FillInterest.fillable(FillInterest.java:102)
	at org.seleniumhq.jetty9.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
	at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)
	at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)
	at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
	at org.seleniumhq.jetty9.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382)
	at org.seleniumhq.jetty9.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708)
	at org.seleniumhq.jetty9.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)

I have let the suite run continuously for days, and only 1 out of 15 has no failures with that footprint. All others have 1-5 (always different tests) test fail with that footprint, scattered across the suite.

Two observations:

1) This does not happen when I run the suite with the following versions:

Zalenium 3.10.0c Elgalu 3.10.0-p3 -- I've just run 5 iterations of the suite forcing those versions, and not one failure, though I'm used to seeing an occasional failure there too, just with a less descriptive footprint re: connection refused / unable to forward / etc.. But even those failures (which I haven't seen in 5 suite runs) happened with much less frequency. than the "No System TLS" error above, when using versions that support Selenium 3.11.

2) When the problem happens, it is almost always immediately after another browser is launched.

I have 6 tests running in parallel, each lasting ~30 seconds <-> 4 minutes. I have all webDriver creation, acquisition, and closing via Java ThreadLocal. But I suspect there is some threadsafe problem that happens directly during the time that the RemoteWebDriver is instantiated, and I suspect that the transition to okHttp is either a new problem, of exacerbated the existing intermittent problem.

For example, with my 6 tests running in parallel, I have ~ 10-20 log messages generated per second for the various 6 tests (I have the testNG thread ID in the log format, differentiating). The log message IMMEDIATELY before the problem is more often than not:

[11:22:46 INFO ProtocolHandshake ] [Forwarding newSession on session null to remote] Detected dialect: OSS

The message above is related to the browser being created for a different test than the test that will have the exception. The very next message is the stack trace from the exception:

org.openqa.selenium.WebDriverException: cannot forward the request No System TLS
Command duration or timeout: 0 milliseconds

and that stack trace shows me that it is from a different test -- not the test that was just opening the browser re: the "Detected dialect: OSS" message above.

Also, remember how I said that I use the java thread name in my logging format? ALL of my log messages have the thread name in them, like this:

[11:22:45 INFO obfuscacted ] [obfuscatedTestName-1] click: next

The "obfuscatedTestName-1" is the name testNG gives that thread, and there are 5 others, keeping the log messages straight.

The only other time I ever see a non testNG thread name anywhere in my entire log is that one message from selenium, where you can see that the threadname is different:

[11:22:46 INFO ProtocolHandshake ] [Forwarding newSession on session null to remote] Detected dialect: OSS

... in which selenium (I assume it's selenium) has chosen to rename the actual thread to: "Forwarding newSession on session null to remote" -- that's the name of the thread itself.

That message occurs for every browser created, but when I have the intermittent problem I've been describing in this report, that log message is often the log message right before an existing test (not the one creating a browser) hits that exception.

@diemol
Copy link
Contributor

diemol commented Mar 21, 2018

Hi @borisivan

That sounds like it could be related to root certificates or something similar? I am not 100% sure, but I see the exception comes from the HttpRequest, we would need to investigate.

@borisivan
Copy link
Author

Hi @diemol, thanks. I wasn't sure if that is coming from container<->container communication/forwarding, or from the selenium server back to the test, etc. Still not really sure.

I have successfully placed the root CA certificates on both the zalenium and the elgalu container, ran update-ca-certificates, and tagged this container (based on latest from both, today). In the suite, I specify that version:

... 6 other elgalu containers
2cf4badf90e5 elgalu/selenium:3.11.0-p4-999 "entry.sh" 24 minutes ago Up 24 minutes obfuscated_40006
4795e2be77c7 dosel/zalenium:3.11.0b-999 "entry.sh start --ch…" 24 minutes ago Up 24 minutes obfuscated
root#

I've verified that from the containers, I can now successfully reach the website I'm testing via https, which I couldn't do before.

before, from within the container, only curl http://the-site would succeed.... https would not.

now, from within the container, the curl command works with both http and https.

I also changed the tests to only attempt to access the site via http, instead of https... same thing. Occasional failure with:

org.openqa.selenium.WebDriverException: cannot forward the request No System TLS

I've also verified that it's added to the java keystore in the containers.

@matthew-horrocks
Copy link

matthew-horrocks commented Mar 26, 2018

We have also been finding tests fail intermittently with the "No System TLS" error (along with a bunch of other Java timeouts; not sure if they're related? https://gist.github.com/matthew-horrocks/b947646aacd07b5c4a4ad4751cb5d01a)

The failure seems to occur at a random point through the test. At the point of failure, screenshots of the test show the browser displaying a perfectly rendered page. Out of 1900 tests, the failures will occur in ~5% of tests.

We're using the dotnet bindings for Selenium with the tests being run from a Windows 2012R2 server.

Rolling back to dosel/zalenium:3.10.0c and the problem goes away.

Let me know what else you need.

Here is our docker info

sudo docker info
Containers: 22
 Running: 8
 Paused: 0
 Stopped: 14
Images: 8
Server Version: 17.12.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-1011-azure
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 55.02GiB
Name: nwdocker
ID: V273:AKHC:RNLR:DFAQ:LUCR:OWYA:BYCR:5YGF:427H:OXOK:ZBDX:UJNS
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Here is our docker-compose.yaml

cat docker-compose.yaml
 # Usage:
    #   docker-compose up --force-recreate
    version: '2.1'

    services:
      #--------------#
      zalenium:
        privileged: true
        restart: always
        image: dosel/zalenium:3.10.0c
        container_name: zalenium
        hostname: zalenium
        tty: true
        volumes:
          - /tmp/videos:/home/seluser/videos
          - /home/seluser/mounted:/tmp/node/home/seluser/mounted
          - /var/run/docker.sock:/var/run/docker.sock
          - /usr/bin/docker:/usr/bin/docker
        ports:
          - 4444:4444
        command: >
          start --desiredContainers 32
                --maxDockerSeleniumContainers 100
                --maxTestSessions 150
                --screenWidth 1920
                --screenHeight 1080
                --timeZone "Europe/London"
                --videoRecordingEnabled false
                --sauceLabsEnabled false
                --browserStackEnabled false
                --testingBotEnabled false
                --startTunnel false

@borisivan
Copy link
Author

borisivan commented Mar 26, 2018

Hi @diemol, taking a step back and looking at this, I think this is simply a different footprint for what has existed all along, which I still can't figure out. There are occasional failures where pre selenium 3.11 the footprint was along the lines of:

org.openqa.selenium.WebDriverException: cannot forward the request null
Command duration or timeout: 0 milliseconds
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
	at org.openqa.selenium.remote.ErrorHandler.createThrowable(ErrorHandler.java:214)
	at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed(ErrorHandler.java:166)
	at org.openqa.selenium.remote.http.JsonHttpResponseCodec.reconstructValue(JsonHttpResponseCodec.java:40)
	at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode(AbstractHttpResponseCodec.java:80)
	at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode(AbstractHttpResponseCodec.java:44)
	at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:160)
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:601)
	at org.openqa.selenium.remote.RemoteWebElement.execute(RemoteWebElement.java:279)
	at org.openqa.selenium.remote.RemoteWebElement.getAttribute(RemoteWebElement.java:128)

and after going to selenium 3.11 with the switch to okHttp the footprint is:

org.openqa.selenium.WebDriverException: cannot forward the request No System TLS
Command duration or timeout: 0 milliseconds
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.openqa.selenium.remote.ErrorHandler.createThrowable(ErrorHandler.java:214)
	at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed(ErrorHandler.java:166)
	at org.openqa.selenium.remote.http.JsonHttpResponseCodec.reconstructValue(JsonHttpResponseCodec.java:40)
	at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode(AbstractHttpResponseCodec.java:80)
	at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode(AbstractHttpResponseCodec.java:44)
	at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:158)
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:545)
	at org.openqa.selenium.remote.RemoteWebElement.execute(RemoteWebElement.java:279)
	at org.openqa.selenium.remote.RemoteWebElement.getText(RemoteWebElement.java:160)

This happens randomly in that it is never the same test, and sometimes all 240 tests will succeed without hitting it.

A couple others have logged issues from time to time re: the 'cannot forward' intermittent problem, and I think we need to find a way to narrow down what's happening.

I think the biggest question I have is: how can I tell if this error is relevant to zalenium / docker, versus a problem on the site I'm testing (or a load balancer that sits in front of it)? Specifically, when we get the "cannot forward" error, I'm not sure what to conclude from the stack trace.

If I think of the overall system as:

[a] the test (webDriver client) on a jenkins slave -> maven -> java -> testNG suite
[b] selenium-hub (zalenium master container)
[c] selenium-node (zalenium child container elgalu/selenium)
[d] the endpoint that services the URL (load balancer in front of pivotal cloud foundry)
[e] the application under test itself

And of course there is [x] -- the host that runs docker, which is where [a] [b] [c] live.

... where is the footprint failure message coming from, and what is is trying to tell us? Who can't forward a request to whom? I'm not sure if everything is working fine re: [a] [b] [c], and the failure message is relative to the load balancer / application saying that it's having a problem, or if the can't forward error is relative to some point of [b] or [c] saying that it can't forward something. Similar question for the "No System TLS" -- what entity is noticing a problem and making this declaration -- who is saying this, with regards to what attempted communication?

chrome_driver.log doesn't show a problem. Things appear fine, the test is moving along and then as soon as the problem is hit, the next set of messages in chrome_driver.log appear to be me setting the cookie re: test failed.

re: pre selenium 3.11 / 3.11, I can say that w/ 3.11 the problem happens about 2x more frequently, and I haven't adjusted the logic in my suite to successfully catch that footprint and mark the test as skipped yet, so at this point other than an experimental node that uses zalenium:latest, I've bound my production to use 3.10 (rebuilt with certs, though the certs have made no impact on anything regarding the frequency of this failure).

@borisivan
Copy link
Author

borisivan commented Mar 26, 2018

Hi @matthew-horrocks -- I also see the footprint (w 3.11) sometimes appear as:

unexpected end of stream on Connection, as in your github link above.

With that many containers I assume you're running tests in parallel -- do you happen to know if the problem seems to occur to session "x" immediately after session "y" either launches a new browser or closes the one it was using?

I wonder if when zalenium goes through it's logic of marking a container as in-use due to a new session being created (or freeing up on browser close), there is a small window of time in which there is some kind of lock or race condition leads to this problem.

edited due to me constantly forgetting that github treats my comments as escape characters....

@diemol
Copy link
Contributor

diemol commented Mar 26, 2018

Yes, I think I wrote something similar in Slack. I believe the change comes from upstream, because Apache Http is being replaced in different parts with OkHttp.
It seems that Apache Http retries connections automatically and you need to explicitly set that in OkHttp, we need to check a bit more to see if that is the case, and if so, send a PR upstream.

I'll keep you posted.

@matthew-horrocks
Copy link

@borisivan sorry, I completely missed your comment at #491 (comment)

Yes, all tests run in parallel, using all available containers. The error seems to occur at a seemingly random point through a test; unfortunately I haven't been able to match the failure time to a browser in another container being torn down.

@borisivan
Copy link
Author

@diemol, regarding the "It seems that Apache Http retries connections automatically and you need to explicitly set that in OkHttp" -- if I understand correctly, you'll be looking at that configuration change re: the okHttp usage, sounds good.

But I was trying to understand more about the problem -- if this is happening mid-test like @matthew-horrocks and I are seeing, which connection is this referring to, and does this mean that a connection that was previously established is getting disconnected when we're not expecting it? Sounds like the 'retry connection' might help something keep moving along if it were to get disconnected, but I'm looking to understand what's getting disconnected, and the typical reason why. That way, maybe I can reduce the likelihood of occurrence via other change.

I was hoping to understand which connection is getting broken -- do we know if it's the TCP connection to the webserver under test? Or is this a TCP connection between zalenium to elgalu (container to container), etc? Knowing which might also help us understand if there is any tuning re: keepalive / intervals that can be tweaked that would help reduce the likelihood of occurrence.

@diemol
Copy link
Contributor

diemol commented Apr 1, 2018

@borisivan @matthew-horrocks, a couple of questions:

  • To understand a bit more the situation, are you both testing with root CA certificates added to the containers?
  • It is not 100% clear for me if the error happens when the test starts, or when the page under test is being loaded, do you have an idea around when the problem shows up?
  • Is it possible to see the Zalenium log in debug mode when the bug happens? If not, at least in normal mode?
  • Are you always using a fixed number of containers and a big number of maxTestSessions? Or are containers being created on the fly as tests are executed?

Regarding the retries parameter in OkHttp, I was checking the Selenium source code, and the OkHttp builder has by default the retry parameter set to true. So it might be that @borisivan is right, that this is the same stack trace we were getting before, but just that OkHttp prompts a readable message.

Based on the feedback we can get from the two previous questions, we can:

  • I found a way to use ApacheHttp instead of OkHttp, so we could configure that and release. If the problem goes away, we know that we can focus on tweaking OkHttp.
  • If the problem does not go 100% away, the problem might be that when a container is being created, it is not 100% ready (some system component didn't load completely) and then even if it registered to the hub it needs a bit more to be up completely. We could add a check for that if we are able to diagnose what has to be checked.

@diemol
Copy link
Contributor

diemol commented Apr 1, 2018

One more question, are you both running tests only on Chrome? Or more specifically, do the errors happen only when Chrome is used?

@borisivan
Copy link
Author

Hi @diemol, a couple of the questions might take some digging, but answering the ones I know the answers to now:

To understand a bit more the situation, are you both testing with root CA certificates added to the containers? I saw the problems before adding the root CA cert to the containers, and continue to see the problem now that I rebuilt with containers with them added -- no impact.

It is not 100% clear for me if the error happens when the test starts, or when the page under test is being loaded, do you have an idea around when the problem shows up? I think I saw it at initial URL get before, but not sure. Latest examples I've looked at are mid test -- initial URL loaded, many selenium steps performed successfully, can happen anywhere in the test.

Is it possible to see the Zalenium log in debug mode when the bug happens? If not, at least in normal mode? At one point I was saving the Zalenium log by default, let me see if I can re-enable that, and I can enable debug. But I wasn't obtaining the various logs from the elgalu containers. If I were to obtain the elgalu container, is there a way to enable that with debug too, or will the debug option only enable debug in the master zalenium container?

Are you always using a fixed number of containers and a big number of maxTestSessions? Or are containers being created on the fly as tests are executed? I certainly see it with 6 containers that have a large # of maxTestSessions, I don't run with the on-the-fly mode, but I can try that if needed.

I found a way to use ApacheHttp instead of OkHttp, so we could configure that and release. If the problem goes away, we know that we can focus on tweaking OkHttp: When I use the 3.10 version of the container, I do see occasional problems that I suspect might be just the different footprint of the same issue that's occurring, though it also happens less frequently. The footprint seen with the 3.11+ version of the container happens 4-5x more frequently (around once every 250 tests).

If the problem does not go 100% away, the problem might be that when a container is being created, it is not 100% ready (some system component didn't load completely) and then even if it registered to the hub it needs a bit more to be up completely. We could add a check for that if we are able to diagnose what has to be checked: problem happens mid-test, so I doubt that's the issue.

@diemol
Copy link
Contributor

diemol commented Apr 2, 2018

https://github.com/zalando/zalenium/releases/tag/3.11.0c was just released, there are some fixes that solve some NPE introduced when adding more logging in the 3.11.0b release. I was bumping into similar (but not the same one) stack traces when running a similar scenario as yours @borisivan.

After reading the source code in Selenium and comparing it to the reported stack trace, the error pops up in middle test, so it is not related to a container being ready or not. And it is unlikely related to be using Apache Http or OkHttp.

My only guess right now is a TSL handshake timeout. Since it works for most of the tests in the same container but not for some.

Before continuing, could you please try the latest release and see if we get a better error message?

@diemol
Copy link
Contributor

diemol commented Apr 2, 2018

Also, would it be possible to try it without root CA?

@sri85
Copy link

sri85 commented Apr 9, 2018

@diemol I am bumping into the issue as well.

I have 3 tests running in parallel each test taking less than 3 seconds to execute. I am randomly seeing issue this issue where the tests fail with the above mentioned error .

# Usage:
    #   docker-compose up --force-recreate
    version: '2.1'
    volumes:
        e2e-videos:
    
    services:
      #--------------#
      elagalu:
        image: "elgalu/selenium"

      zalenium:
        image: "dosel/zalenium:3.11.0d"
        container_name: zalenium
        hostname: zalenium
        privileged: true
        volumes:
          - e2e-videos:/home/seluser/videos
          - /var/run/docker.sock:/var/run/docker.sock
          - ./testfiles:/tmp/node/home/seluser/testfiles
        ports:
          - 4444:4444
        command: >
          start --desiredContainers 5
                --maxDockerSeleniumContainers 8
                --screenWidth 1920 --screenHeight 1080
                --timeZone "Europe/London"
                --videoRecordingEnabled true
                --sauceLabsEnabled false
                --browserStackEnabled false
                --testingBotEnabled false
                --startTunnel false
                --keepOnlyFailedTests true
     

This is how my docker compose file looks like

@pearj
Copy link
Collaborator

pearj commented May 11, 2018

This is not really a solution, but I discovered that you can switch the http client from okhttp back to apache via a system property, if somebody wants to try setting this environment variable on their zalenium container, then it'd be interesting to know if it's more stable:

ZALENIUM_EXTRA_JVM_PARAMS="-Dwebdriver.http.factory=apache"

@diemol says this isn't a long term solution as apache will be removed.

But thanks to @gsaimadhu on #568 he posted a more complete stack trace with a curious caused by:

Caused by: java.security.KeyStoreException: problem accessing trust storejava.io.IOException: Invalid keystore format
	at sun.security.ssl.TrustManagerFactoryImpl.engineInit(TrustManagerFactoryImpl.java:74)
	at javax.net.ssl.TrustManagerFactory.init(TrustManagerFactory.java:250)
	at okhttp3.OkHttpClient.systemDefaultTrustManager(OkHttpClient.java:284)

It seems when the okhttp client initialises it configures TLS (which we don't actually need) and randomly initialising TLS is failing because something is screwing up reading the java trust store.

Maybe it ran out of file handles or something strange like that.

It might be possible to cache the trust manager so that all instances of okhttp client use the same trust manager (since we don't actually even need to use it as communication to the nodes is not https).

@pearj
Copy link
Collaborator

pearj commented May 11, 2018

I found where the "Invalid keystore format" exception comes from: http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8-b132/sun/security/provider/JavaKeyStore.java#650

But it doesn't really help us, basically, it sounds like as it's reading the file the "magic" file headers aren't matching.

@gsaimadhu
Copy link

With this issue unresolved for so many weeks, the development team is losing confidence in automated testing because of regular false alarms from our qa side.
Till the issue is resolved at Zalenium, can Zalenium give workaround in options to enable the direct docker selenium to proceed our testing activities without this issue.

@borisivan
Copy link
Author

This has prevented me from using any new Zalenium builds as well, though we still use Zalenium. I provide the arguments to the docker run to force the specific version of Zalenium as well as the elgalu container to be the version that existed before they made this change (several months old at this point), and obviously don't see the TLS problem.

But I don't get the benefit of other bug fixes or latest versions of chrome/FF/selenium inside the container, either, which is a serious problem, but not as a serious as having 1-2 random tests fail each suite execution with the TLS error message.

@diemol
Copy link
Contributor

diemol commented May 14, 2018

@gsaimadhu @borisivan

I understand what you guys mean, but that is the current status. Of course we want to fix it, and if we had found what the exact reason is and how it could be fixed, we would have done it already.

Please also understand that we have jobs and other tasks, and therefore we cannot spend all the time we want in this.

With the recent info added in the duplicated issue we will investigate more about that stacktrace and hopefully find a fix.

@borisivan
Copy link
Author

Hi @diemol --understood. It also wouldn't surprise me if this ends up being something like what @pearj mentions re: file handles or some such resource, which might be more affected with the okHTTP library instead of apache, etc, and in reality an overall docker-host-infrastructure problem as opposed to any zalenium-specific code. But alas, the effect is pretty severe.

@gsaimadhu
Copy link

@diemol, i understand and appreciate you and your team initiatives and hardwork for community. My intention is not to put pressure on you. My intention is to request you an alternate way to unblock us with an option if possible to use direct docker selenium using Zalenium.

@diemol
Copy link
Contributor

diemol commented May 14, 2018

@gsaimadhu one alternative to try for now was to do what @pearj suggested in this comment

@gsaimadhu
Copy link

gsaimadhu commented May 15, 2018

Sure, I want to try this. But, need clarification on where and how to use this variable setting:ZALENIUM_EXTRA_JVM_PARAMS="-Dwebdriver.http.factory=apache"

Do i need to run as part of Zalenium command like below?
docker run --rm -t --name zalenium -p 4444:4444 -e ZALENIUM_EXTRA_JVM_PARAMS="-Dwebdriver.http.factory=apache" -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/videos:/home/seluser/videos --add-host "t1.l7tech.com:$ipa" --add-host "tenant2.l7tech.com:$ipa" --privileged dosel/zalenium start --desiredContainers 12 --maxDockerSeleniumContainers 30 --maxTestSessions 100 > ZaleniumStartLogs.log &

@gsaimadhu
Copy link

Thanks diemol. I have run the command as specified in the above comment.

This issue is NOT observed with the setting ZALENIUM_EXTRA_JVM_PARAMS="-Dwebdriver.http.factory=apache"

For your reference, i am attaching the Zalenium logs. Please go through and let me know if any further investigation required.
ZaleniumStartLogs.log

@diemol
Copy link
Contributor

diemol commented May 16, 2018

Thanks for the feedback @gsaimadhu!

I think that means that the issue is in the Zalenium container and not in the docker-selenium ones. We will keep checking there.

@borisivan Could you please try the same workaround to confirm?

@amalsgit
Copy link

I was facing the same issue and the workaround did the trick. I am not facing the TLS error now. @diemol

@diemol
Copy link
Contributor

diemol commented May 26, 2018

Thank you @amalsgit @gsaimadhu for confirming that using Apache makes the issue go away. Nevertheless, this is just a workaround since in the long term the Grid will use only OkHttp.

Therefore, I have been working on this one most of the day and I have not been able to figure out the root cause. The provided logs show that this is happening after a session has been created and using nodes that have already executed more than one test. This kind of discards that there is an issue with the containers or their configuration.

What I also tried was to run this code inside the containers to see what SSL supported and enabled protocols they have:

import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSocket;
import javax.net.ssl.SSLSocketFactory;
import java.io.IOException;
import java.security.KeyManagementException;
import java.security.NoSuchAlgorithmException;

public class ProtocolTest {

  public static void main (String[] args) throws NoSuchAlgorithmException, KeyManagementException, IOException {
    SSLContext context = SSLContext.getInstance("TLS");
    context.init(null,null,null);

    SSLSocketFactory factory = context.getSocketFactory();
    SSLSocket socket = (SSLSocket)factory.createSocket();

    String[] protocols = socket.getSupportedProtocols();

    System.out.println("Supported Protocols: " + protocols.length);
    for(int i = 0; i < protocols.length; i++)
    {
      System.out.println(" " + protocols[i]);
    }

    protocols = socket.getEnabledProtocols();

    System.out.println("Enabled Protocols: " + protocols.length);
    for(int i = 0; i < protocols.length; i++)
    {
      System.out.println(" " + protocols[i]);
    }
  }

}

And the result in both (Zalenium and docker-selenium) was:

Supported Protocols: 5
 SSLv2Hello
 SSLv3
 TLSv1
 TLSv1.1
 TLSv1.2
Enabled Protocols: 3
 TLSv1
 TLSv1.1
 TLSv1.2

Which confirms that the underlaying configuration looks fine.

I haven't been able to reproduce it constantly (which makes sense since the bug reports show that it does not happen always). So what we could try is to overwrite the getHttpClient method in the registry and retry there in case the error is present, do a release and wait for feedback.

Since I have not seen this bug reported anywhere else under similar conditions (OkHttp in docker) I am going into the direction of thinking that there is something not 100% clear and retry could work. I will prepare a PR and then we can release soon to get feedback.

@pearj
Copy link
Collaborator

pearj commented May 27, 2018

@diemol for retry you’d probably want to try an exponential backoff type scenario with a random wait, since it might be related to file handles you need to spread out the times that they are retrying.

But I think we could get a long way by reimplementing the okhttp client in the registry, but caching the ssl bits across all requests, as they are the ones that failed. From my quick research they appear thread safe. Plus the regular docker proxies won’t use ssl anyway. Only the cloud proxies I’m guessing? Or maybe not even them?

@diemol
Copy link
Contributor

diemol commented May 27, 2018

@pearj for the first approach I want to implement a very simple retry (3 times) with a random delay between them and see what feedback we get. If that actually solves the problem, then we can evolve it to something more sophisticated I would say.

The good thing is that I was able to reproduce it using brute force, by this I mean that I executed 100 times a small test suite with two threads and three tests per thread, starting Zalenium from scratch and stopping it afterwards. I got the TLS thing two times.

So I'll implement the retry and see what happens today.

@diemol
Copy link
Contributor

diemol commented May 27, 2018

About reimplementing the OkHttp client, I don't know, I would not like to divert too much from the upstream implementation unless it is really needed. If there is really something that needs to be handled, ideally we should implement it, try it, and then send a PR upstream.

@pearj
Copy link
Collaborator

pearj commented May 27, 2018

@diemol yeah fair enough. I think the main thing we need to check is that SSL to different hosts still works after changing the OkHttp client, although I'm not immediately sure how to run a selenium node under SSL.

@pearj
Copy link
Collaborator

pearj commented May 27, 2018

Hmm, having a look at caching the trustmanager and the ssl connection factory. It would be a bit fiddly, or we'd have to copy/paste init code out of okhttp itself. Can't see the selenium guys accepting something like that...

Maybe we need more information about this problem, ie if @amalsgit @gsaimadhu have unusually low file limits set for docker.

@diemol you said that you'd seen this error before too?

@pearj
Copy link
Collaborator

pearj commented May 27, 2018

Hmm, I think one way to reuse the http client is to use an existing http client to build another one. That might be the way to do it.

https://github.com/square/okhttp/blob/db18f0e404ab35c7b23e8ecc27913d717e96eb84/okhttp/src/main/java/okhttp3/OkHttpClient.java#L234

@pearj
Copy link
Collaborator

pearj commented May 27, 2018

Ahh, you can grab the factories out of an existing client.

I guess that'd be another way to do it. You'd have a build an initial client, and then grab the sslSocketFactory and certificateChainCleaner and cache it.

@diemol are the selenium guys likely to accept something we don't know how to reproduce, ie, we say we're fixing this "No System TLS" problem, so we refactor OkHttpClient usage a bit, but we can't reproduce the original problem? Although they could obviously verify that we haven't broken selenium by our refactor.

@diemol
Copy link
Contributor

diemol commented May 27, 2018

@pearj I was able to reproduce just by running lots of tests, after executing ~500 tests I get the issue once or twice.

The implemented retry is very naive, just catches the AssertionError, logs it and tries again up to 3 times, after that, it gets thrown.

So I executed the ~500 tests a couple of times more, with a local version of Zalenium with the retry, and I can see how the issue pops up but the retry handles it and by the 2nd retry it goes well. I'll run the tests a couple of times more and then make a PR is things keep going well.

@diemol
Copy link
Contributor

diemol commented May 27, 2018

Just created #597 which seems to solve the issue with a very simple retry. If things go well, we'll merge and release soon so you guys can check it.

@pearj
Copy link
Collaborator

pearj commented May 27, 2018

@diemol interesting, how many concurrent threads did you have running? I presume this was docker and not minikube?

Also, just read up a bit and noticed you said you could reproduce before, missed that message, whoops!

@pearj
Copy link
Collaborator

pearj commented May 27, 2018

I wonder if this issue is reproducible on the official docker selenium containers?

@diemol
Copy link
Contributor

diemol commented May 28, 2018

@pearj I only had two threads with 3 tests each one, basically I ran the tests we execute when we want to release elgalu/selenium.

The retry is being logged like this:
Successfully created HttpClient for url http://172.17.0.6:40002/wd/hub, after attempt #2.

I can try to reproduce only with the containers in the next days. Could you have a look at #597 please?

diemol added a commit that referenced this issue May 30, 2018
diemol added a commit that referenced this issue May 30, 2018
Adding retry to getHttpClient when "No System TLS" pops up #491
@diemol
Copy link
Contributor

diemol commented May 31, 2018

Hi all,

This has been fixed in the latest release, the workaround should not be needed anymore. Please try the https://github.com/zalando/zalenium/releases/tag/3.12.0c release and let us know how it went.

@DowneyTung
Copy link

DowneyTung commented Oct 8, 2018

@diemol I am still facing this issue when i run my test in parallel using the either 3.12.0c or the latest zalenium image on MacOS High Sierra version 1-.13.6
Here is how my zalenium setup looks like in docker-compose file:
version: "3" services: zalenium: image: "dosel/zalenium:3.12.0c" container_name: zalenium hostname: zalenium tty: true volumes: - '/tmp/videos:/home/seluser/videos' - '/var/run/docker.sock:/var/run/docker.sock' - '/usr/local/bin/docker:/usr/local/bin/docker' ports: - 4444:4444 command: > start --chromeContainers 4 --firefoxContainers 0 --maxDockerSeleniumContainers 6 --maxTestSessions 200 --screenWidth 1200 --screenHeight 900 --timeZone "America/New_York" --videoRecordingEnabled false --sauceLabsEnabled false --browserStackEnabled false --testingBotEnabled false --startTunnel false environment: - HOST_UID - HOST_GID - SAUCE_USERNAME - SAUCE_ACCESS_KEY - BROWSER_STACK_USER - BROWSER_STACK_KEY - TESTINGBOT_KEY - TESTINGBOT_SECRET

And I have actually about 4 tests running in parallel and about 50 tests running in total. The error also happened randomly during the test, about 1 to 3 tests would randomly fail when I run the test.
Here is the stack trace for the test,
2018-10-08 17:23:11,855 root INFO Navigate to: 'https://www.google.com/'__1ee903ef 2018-10-08 17:23:11,857 selenium.webdriver.remote.remote_connection DEBUG GET http://192.168.208.2:4444/wd/hub/session/f50bdba3f99c0e00e766dbced716118d/url {"sessionId": "f50bdba3f99c0e00e766dbced716118d"} 2018-10-08 17:23:11,857 urllib3.connectionpool DEBUG Starting new HTTP connection (1): 192.168.208.2:4444 2018-10-08 17:23:12,907 urllib3.connectionpool DEBUG http://192.168.208.2:4444 "GET /wd/hub/session/f50bdba3f99c0e00e766dbced716118d/url HTTP/1.1" 500 11367 2018-10-08 17:23:12,908 selenium.webdriver.remote.remote_connection DEBUG Finished Request 2018-10-08 17:23:12,911 root INFO Take screenshot error 2018-10-08 17:23:12,915 selenium.webdriver.remote.remote_connection DEBUG GET http://192.168.208.2:4444/wd/hub/session/f50bdba3f99c0e00e766dbced716118d/screenshot {"sessionId": "f50bdba3f99c0e00e766dbced716118d"} 2018-10-08 17:23:12,916 urllib3.connectionpool DEBUG Starting new HTTP connection (1): 192.168.208.2:4444 2018-10-08 17:23:17,109 urllib3.connectionpool DEBUG http://192.168.208.2:4444 "GET /wd/hub/session/f50bdba3f99c0e00e766dbced716118d/screenshot HTTP/1.1" 200 1092162 2018-10-08 17:23:17,194 selenium.webdriver.remote.remote_connection DEBUG Finished Request 2018-10-08 17:23:17,209 root INFO error__ccc2846b 2018-10-08 17:23:17,250 root ERROR An error ocurred: Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/selenium-3.14.1-py3.6.egg/selenium/webdriver/remote/webdriver.py", line 669, in current_url return self.execute(Command.GET_CURRENT_URL)['value'] File "/usr/local/lib/python3.6/site-packages/selenium-3.14.1-py3.6.egg/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/usr/local/lib/python3.6/site-packages/selenium-3.14.1-py3.6.egg/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: cannot forward the request unexpected end of stream on Connection{192.168.208.6:40002, proxy=DIRECT hostAddress=/192.168.208.6:40002 cipherSuite=none protocol=http/1.1} Stacktrace: at org.openqa.grid.web.servlet.handler.RequestHandler.process (RequestHandler.java:143) at org.openqa.grid.web.servlet.DriverServlet.process (DriverServlet.java:86) at org.openqa.grid.web.servlet.DriverServlet.doGet (DriverServlet.java:64) at javax.servlet.http.HttpServlet.service (HttpServlet.java:687) at javax.servlet.http.HttpServlet.service (HttpServlet.java:790) at org.seleniumhq.jetty9.servlet.ServletHolder.handle (ServletHolder.java:860) at org.seleniumhq.jetty9.servlet.ServletHandler$CachedChain.doFilter (ServletHandler.java:1650) at io.prometheus.client.filter.MetricsFilter.doFilter (MetricsFilter.java:170) at org.seleniumhq.jetty9.servlet.ServletHandler$CachedChain.doFilter (ServletHandler.java:1637) at org.seleniumhq.jetty9.servlet.ServletHandler.doHandle (ServletHandler.java:533) at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle (ScopedHandler.java:143) at org.seleniumhq.jetty9.security.SecurityHandler.handle (SecurityHandler.java:548) at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle (HandlerWrapper.java:132) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle (ScopedHandler.java:190) at org.seleniumhq.jetty9.server.session.SessionHandler.doHandle (SessionHandler.java:1595) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle (ScopedHandler.java:188) at org.seleniumhq.jetty9.server.handler.ContextHandler.doHandle (ContextHandler.java:1253) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope (ScopedHandler.java:168) at org.seleniumhq.jetty9.servlet.ServletHandler.doScope (ServletHandler.java:473) at org.seleniumhq.jetty9.server.session.SessionHandler.doScope (SessionHandler.java:1564) at org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope (ScopedHandler.java:166) at org.seleniumhq.jetty9.server.handler.ContextHandler.doScope (ContextHandler.java:1155) at org.seleniumhq.jetty9.server.handler.ScopedHandler.handle (ScopedHandler.java:141) at org.seleniumhq.jetty9.server.handler.StatisticsHandler.handle (StatisticsHandler.java:169) at org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle (HandlerWrapper.java:132) at org.seleniumhq.jetty9.server.Server.handle (Server.java:530) at org.seleniumhq.jetty9.server.HttpChannel.handle (HttpChannel.java:347) at org.seleniumhq.jetty9.server.HttpConnection.onFillable (HttpConnection.java:256) at org.seleniumhq.jetty9.io.AbstractConnection$ReadCallback.succeeded (AbstractConnection.java:279) at org.seleniumhq.jetty9.io.FillInterest.fillable (FillInterest.java:102) at org.seleniumhq.jetty9.io.ChannelEndPoint$2.run (ChannelEndPoint.java:124) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.doProduce (EatWhatYouKill.java:247) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.produce (EatWhatYouKill.java:140) at org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.run (EatWhatYouKill.java:131) at org.seleniumhq.jetty9.util.thread.ReservedThreadExecutor$ReservedThread.run (ReservedThreadExecutor.java:382) at org.seleniumhq.jetty9.util.thread.QueuedThreadPool.runJob (QueuedThreadPool.java:708) at org.seleniumhq.jetty9.util.thread.QueuedThreadPool$2.run (QueuedThreadPool.java:626) at java.lang.Thread.run (Thread.java:748)

Looks like there is 500 error when I try to connect to hub, so the request can not be forwarded to the node container....I am not sure why this happened...let me know if you need more info for debugging this...

@pschroeder89
Copy link

I wonder if this issue is reproducible on the official docker selenium containers?

I am seeing this with the standard Docker Selenium images (I've tried with 3.13.0 and 3.14.0). Can I pass the Apache Dwebdriver flag to the standard images, or is that just an option with Zalenium?
Alternatively, could you add the retry logic to the standard Selenium images @diemol?

@diemol
Copy link
Contributor

diemol commented Dec 6, 2018

@pschroeder89 yes, you can pass that option since it is a standard Selenium thing. Bare in mind that 3.14.0 is the last version where the Apache Http Client is present.
So, if you are bumping into that error with 3.141.59, you should create an issue in the main Selenium repo with a reproducible scenario.

@pschroeder89
Copy link

Got it, thanks @diemol!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants