Errors "code 1006 is reserved" have appeared recently in OSIO Che logs #672

l0rd · 2018-05-11T07:19:49Z

{"@timestamp":"2018-05-09T13:06:09.233+00:00","@version":1,"message":"Current reconnect backoff is 1000 milliseconds (T0)","logger_name":"io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager","thread_name":"scheduleReconnect|Executor for Watch 191281878","level":"INFO","level_value":20000}
{"@timestamp":"2018-05-09T13:06:16.186+00:00","@version":1,"message":"Exec Failure","logger_name":"io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager","thread_name":"OkHttp https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/...","level":"WARN","level_value":30000,"stack_trace":"java.net.ProtocolException:  Code 1006 is reserved and may not be used.\n\tat okhttp3.internal.ws.WebSocketReader.readControlFrame(WebSocketReader.java:202)\n\tat okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:99)\n\tat okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)\n\tat okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)\n\tat okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)\n\tat okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"}

The text was updated successfully, but these errors were encountered:

l0rd · 2018-05-11T14:23:11Z

We should investigate on that during current sprint

davidfestal · 2018-05-14T15:03:50Z

According to what I looked into the Che server Java code, the error that occurs during the call to io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.runWatch(), is due to the fact that is receives a 1006 status code during Websocket communication, which is a reserved code indicating that the connection was closed abnormally (see https://tools.ietf.org/html/rfc6455#section-7.4.1).

Corresponding errors can be found in the OSO proxy, as show by the POD logs and Kibana logs.

Further investigation on prod-preview showed that:

the first one of those errors on the f8osoproxy side occured precisely on May 3rd 2018, 18:43
the current f8osoproxy POD ( f8osoproxy-23-psm57 ) was started after deployment #23 on May 3rd 2018, 18:34
the deployed commit is fabric8-services/fabric8-oso-proxy@d3dd6ee

What we can see on production seems consistent: start of the last POD (f8osoproxy-17-k8w2p) deployed from the same commit occurred on May 2nd 2018, 23:20, and the first code 1006 error occurred on May 2nd 2018, 23:27.

FTR the same error message in mentioned in several issues opened recently against traefik about websocket communication failures, such as traefik/traefik#2828 and traefik/traefik#3165

@aslakknutsen @nurali-techie Could it be that some changes in commit fabric8-services/fabric8-oso-proxy@d3dd6ee would have introduced this error ? for example some changes that would prevent websocket connections to be correctly received / taken in account ?

nurali-techie · 2018-05-15T04:10:01Z

@davidfestal I am looking into it and I will soon update the finding here.

aslakknutsen · 2018-05-15T08:50:54Z

@davidfestal @nurali-techie The 1006 Code is something you see on the client side? Or something that comes from the server? It would make sense that if the proxy is redeployed you loose connection to the pod you were connected to and need to reconnect. A bit unsure if you see the failure on reconnect or before. If before, then it sounds like the connection was just dropped server side. If after.. could it be trying some 'half way reconnect' assuming some server side state that is no longer there?

aslakknutsen · 2018-05-15T08:55:19Z

@davidfestal @nurali-techie While possible, I think it's unlikely that it's related to that specific commit on the proxy side. I think it just happens to kinda correlate with the 2 weeks of logs we have.

davidfestal · 2018-05-16T08:01:52Z

It would make sense that if the proxy is redeployed you loose connection to the pod you were connected to and need to reconnect. A bit unsure if you see the failure on reconnect or before. If before, then it sounds like the connection was just dropped server side. If after.. could it be trying some 'half way reconnect' assuming some server side state that is no longer there?

@aslakknutsen it doesn't seem to be the root cause of the problem, if I understand correctly, because prod-preview Che server was redeployed just yesterday while the oso proxy was still running, and the error came back rapidly in the Che server logs (I would say as soon as a workspace is started and the Che server tries to keep a long-running websocket connection with it). This seems to mean that the error doesn't come from the Che server side due to a restart of the oso proxy. But tell me if I have misunderstood your point.

While possible, I think it's unlikely that it's related to that specific commit on the proxy side. I think it just happens to kinda correlate with the 2 weeks of logs we have.

I didn't just use the logs, but also Kibana precise requests and the dates of Openshift deployments for both the OSO proxy and the Che server. And the errors started happening precisely some minutes after the deployment of this OSO proxy commit for both production and prod-preview.

Shouldn't we find a way to monitor websocket connections between the Che server and the OSO proxy ? Something like running an additional container with tcpdump and analyzing the result with wireshark...

Or is there a way to enable more debugging on the oso proxy side to dump details websocket traffic ?

nurali-techie · 2018-05-16T09:29:15Z

@davidfestal yesterday this link was showing some log on 1-May-2018. Today, it's not showing any record, may be kibana keeping only last 15 days logs. But we are quite sure that the issue is not with git_commit as we have observed older logs before commit with same issue.

To cross check, kibana only keep 15 days logs - check this link. It has no filter other than date range "30-apr to 1-may" and no record is shown.

davidfestal · 2018-05-16T09:57:12Z

@nurali-techie anyway, as detailed in my previous comment, and what I see from the Java code on the Che server, I don't think that it is related to the Che server loosing the connection with the oso proxy because of an oso proxy restart. Afaict, there's a web socket communication that cannot be done correctly between both, and he Che server schedules a reconnect again and again on the Che server side. Is there any option to enable more logs / traces in websocket communications on the proxy side ?

nurali-techie · 2018-05-17T09:41:11Z

@davidfestal we are able to re-produce this error with small program which call OSO WebSocket API directly (no oso-proxy in between). It seems that OSO server giving this error to oso-proxy and oso-proxy logs (which observed in kibana) and return to client (in your case Che server).

Here is that small program.
OSO WS Client:

package main

import (
	"fmt"
	"log"
	"net/http"
	"net/url"
	"os"
	"time"

	"github.com/gorilla/websocket"
)

// Help: example run with arguments
// go run main.go api.starter-us-east-2a.openshift.com nvirani-preview xxxx
func main() {
	osoServer := os.Args[1]    // api.starter-us-east-2a.openshift.com
	osoNamespace := os.Args[2] // nvirani-preview
	osoToken := os.Args[3]

	addr := osoServer
	// path := fmt.Sprintf("/oapi/v1/namespaces/%s/buildconfigs", osoNamespace)
	path := fmt.Sprintf("/oapi/v1/namespaces/%s/builds", osoNamespace)

	u := url.URL{Scheme: "wss", Host: addr, Path: path, RawQuery: "watch=true"}
	reqHeader := http.Header{
		"Authorization": {fmt.Sprintf("Bearer %s", osoToken)},
		"Origin":        {"localhost:8080"},
	}

	log.Printf("Connecting to url:%s", u.String())
	c, _, err := websocket.DefaultDialer.Dial(u.String(), reqHeader)
	if err != nil {
		log.Fatal("dial:", err)
	}
	defer c.Close()

	for i := 0; ; i++ {
		start := time.Now()
		_, msg, err := c.ReadMessage()
		if err != nil {
			log.Printf("Got err:%s, after=%f sec", err, time.Since(start).Seconds())
			break
		}
		// log.Printf("msg-%d:%s", (i + 1), string(msg))
		log.Printf("Got msg-%d, with len=%d, after=%f sec", (i + 1), len(msg), time.Since(start).Seconds())
	}
}

Run Command: go run main.go api.starter-us-east-2a.openshift.com nvirani-preview xxxx
Output:

2018/05/17 13:01:26 Connecting to url:wss://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/nvirani-preview/buildconfigs?watch=true
2018/05/17 13:01:28 Got msg-1, with len=1135, after=0.000074 sec
2018/05/17 13:01:28 Got msg-2, with len=1121, after=0.000017 sec
2018/05/17 13:01:28 Got msg-3, with len=1121, after=0.000028 sec
2018/05/17 13:01:28 Got msg-4, with len=1121, after=0.000044 sec
2018/05/17 13:08:08 Got err:websocket: close 1006 (abnormal closure): unexpected EOF, after=399.570379 sec

This program call one of the OSO WebSocket API and call c.ReadMessage() multiple times. After getting 4 msg, call to c.ReadMessage() block for ~400 seconds and then return same error websocket: close 1006 (abnormal closure): unexpected EOF

Conclusion:
OSO WebSocket API returning 'close 1006' and which oso-proxy propogate to Che server. We should reach out to OSO team (i.e OpenShift team) to find out details on this behavior.

nurali-techie · 2018-05-17T10:45:11Z

@davidfestal I have tried with another OSO WebSocket API which doesn't giving this error close 1006 and waiting on c.ReadMessage() infinitely.
API Call2 - wss://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/nvirani-preview/builds?watch=true
It would be great if you can find out which OSO WebSocket API your Che server calls?

@riuvshin The issue seems to be with specific OpenShift WebSocket API and not all.

riuvshin · 2018-05-17T10:54:57Z

@nurali-techie so it does not work properly with this url:wss://api.starter-us-east-2a.openshift.com and which one was working properly for you?
cc: @jfchevrette

nurali-techie · 2018-05-17T10:59:13Z

@riuvshin
API with issue and return error - wss://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/nvirani-preview/buildconfigs?watch=true

API that works - wss://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/nvirani-preview/builds?watch=true

In one line, 'buildconfigs' API gives error while 'builds' API works.

davidfestal · 2018-05-18T15:24:29Z

@riuvshin @nurali-techie I debugged the prod-preview Che server to find out the context in which this error occurs, and more precisely the OpenShift resource the Che server is being watching when this error occurs.

And it showed that this error only occurs in a very specific use-case related to a specific type of POD that is started asynchronously to cleanup the PV from obsolete directories (with a rm -rf command) after the user has deleted a workspace from the Che server.

When looking into this specific code and debugging it, it seems to me that there might be something incorrect in the way the POD is created, run, watched and waited for. This finally results in the POD watcher trying to reconnect websockets again and again to a POD that, in fact, has already been deleted. A sort of race condition related to the fact that these PODs run for a very short amount of time.
This behavior could easily explain the 1006 code that we receive.

I'll investigate more to find out where, in the implementation, is the bug that leads to this situation.
However I can already say that I now fully consider this problem as a Che server problem.

jfchevrette · 2018-05-22T15:08:57Z

Could this be a problem with the proxy? I'm told that k8s/openshift has no code that returns this to the client.

Ref https://bugzilla.redhat.com/show_bug.cgi?id=1579965

davidfestal · 2018-06-05T08:48:12Z

Let me summarize the root cause of this error and the fix: The error was happening only when the PV-cleaning POD could not be created (due to quotas presumably). In this case, the Che server was still trying to delete the POD, and thus also endlessly trying to watch for the end of the POD deletion, which in fact didn't occur. This is the root cause of the regular web-socket reconnect attempts that always led to the 1006 code.

Adding more robustness in the POD deletion method inside the Che server is the fix provided by PR eclipse-che/che#9932.

…sting POD (#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>

…sting POD (eclipse-che#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>

* Watch connection manager never closed when trying to delete a non-existing POD (#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com> * CHE-5918 Add an ability to interrupt Kubernetes/OpenShift runtime start Signed-off-by: Sergii Leshchenko <sleshche@redhat.com> * CHE-5918 Add checking of start interruption by KubernetesBootstrapper It is needed to avoid '403 Pod doesn't exists' errors. It happens when start is interrupted when any of machines is on bootstrapping phase. As result connection leak happens TODO Create an issue for fabric8-client * Improve ExecWatchdog to rethrow exception when it occurred while establishing a WebSocket connection * CHE-5918 Fix K8s/OS runtime start failing when unrecoverable event occurs

…sting POD (eclipse-che#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>

l0rd added kind/bug severity/P1 labels May 11, 2018

l0rd assigned davidfestal May 14, 2018

l0rd modified the milestones: sprint #149 (rh-che), Sprint #150 (rh-che) May 25, 2018

davidfestal mentioned this issue Jun 4, 2018

Watch connection manager never closed when trying to delete a non-existing POD eclipse-che/che#9932

Merged

davidfestal closed this as completed Jun 5, 2018

l0rd mentioned this issue Jun 6, 2018

Bump to v6.5.3 #702

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors "code 1006 is reserved" have appeared recently in OSIO Che logs #672

Errors "code 1006 is reserved" have appeared recently in OSIO Che logs #672

l0rd commented May 11, 2018

l0rd commented May 11, 2018

davidfestal commented May 14, 2018

nurali-techie commented May 15, 2018

aslakknutsen commented May 15, 2018

aslakknutsen commented May 15, 2018

davidfestal commented May 16, 2018

nurali-techie commented May 16, 2018

davidfestal commented May 16, 2018

nurali-techie commented May 17, 2018 •

edited

nurali-techie commented May 17, 2018 •

edited

riuvshin commented May 17, 2018 •

edited

nurali-techie commented May 17, 2018 •

edited

davidfestal commented May 18, 2018

jfchevrette commented May 22, 2018

davidfestal commented Jun 5, 2018

Errors "code 1006 is reserved" have appeared recently in OSIO Che logs #672

Errors "code 1006 is reserved" have appeared recently in OSIO Che logs #672

Comments

l0rd commented May 11, 2018

l0rd commented May 11, 2018

davidfestal commented May 14, 2018

nurali-techie commented May 15, 2018

aslakknutsen commented May 15, 2018

aslakknutsen commented May 15, 2018

davidfestal commented May 16, 2018

nurali-techie commented May 16, 2018

davidfestal commented May 16, 2018

nurali-techie commented May 17, 2018 • edited

nurali-techie commented May 17, 2018 • edited

riuvshin commented May 17, 2018 • edited

nurali-techie commented May 17, 2018 • edited

davidfestal commented May 18, 2018

jfchevrette commented May 22, 2018

davidfestal commented Jun 5, 2018

nurali-techie commented May 17, 2018 •

edited

nurali-techie commented May 17, 2018 •

edited

riuvshin commented May 17, 2018 •

edited

nurali-techie commented May 17, 2018 •

edited