Skip to content
This repository has been archived by the owner on Mar 17, 2021. It is now read-only.

Errors "code 1006 is reserved" have appeared recently in OSIO Che logs #672

Closed
l0rd opened this issue May 11, 2018 · 15 comments
Closed

Errors "code 1006 is reserved" have appeared recently in OSIO Che logs #672

l0rd opened this issue May 11, 2018 · 15 comments

Comments

@l0rd
Copy link
Contributor

l0rd commented May 11, 2018

{"@timestamp":"2018-05-09T13:06:09.233+00:00","@version":1,"message":"Current reconnect backoff is 1000 milliseconds (T0)","logger_name":"io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager","thread_name":"scheduleReconnect|Executor for Watch 191281878","level":"INFO","level_value":20000}
{"@timestamp":"2018-05-09T13:06:16.186+00:00","@version":1,"message":"Exec Failure","logger_name":"io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager","thread_name":"OkHttp https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/...","level":"WARN","level_value":30000,"stack_trace":"java.net.ProtocolException:  Code 1006 is reserved and may not be used.\n\tat okhttp3.internal.ws.WebSocketReader.readControlFrame(WebSocketReader.java:202)\n\tat okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:99)\n\tat okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)\n\tat okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)\n\tat okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)\n\tat okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"}

image.png

@l0rd
Copy link
Contributor Author

l0rd commented May 11, 2018

We should investigate on that during current sprint

@davidfestal
Copy link
Collaborator

According to what I looked into the Che server Java code, the error that occurs during the call to io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.runWatch(), is due to the fact that is receives a 1006 status code during Websocket communication, which is a reserved code indicating that the connection was closed abnormally (see https://tools.ietf.org/html/rfc6455#section-7.4.1).

Corresponding errors can be found in the OSO proxy, as show by the POD logs and Kibana logs.

Further investigation on prod-preview showed that:

What we can see on production seems consistent: start of the last POD (f8osoproxy-17-k8w2p) deployed from the same commit occurred on May 2nd 2018, 23:20, and the first code 1006 error occurred on May 2nd 2018, 23:27.

FTR the same error message in mentioned in several issues opened recently against traefik about websocket communication failures, such as traefik/traefik#2828 and traefik/traefik#3165

@aslakknutsen @nurali-techie Could it be that some changes in commit fabric8-services/fabric8-oso-proxy@d3dd6ee would have introduced this error ? for example some changes that would prevent websocket connections to be correctly received / taken in account ?

@nurali-techie
Copy link

@davidfestal I am looking into it and I will soon update the finding here.

@aslakknutsen
Copy link

@davidfestal @nurali-techie The 1006 Code is something you see on the client side? Or something that comes from the server? It would make sense that if the proxy is redeployed you loose connection to the pod you were connected to and need to reconnect. A bit unsure if you see the failure on reconnect or before. If before, then it sounds like the connection was just dropped server side. If after.. could it be trying some 'half way reconnect' assuming some server side state that is no longer there?

@aslakknutsen
Copy link

@davidfestal @nurali-techie While possible, I think it's unlikely that it's related to that specific commit on the proxy side. I think it just happens to kinda correlate with the 2 weeks of logs we have.

@davidfestal
Copy link
Collaborator

It would make sense that if the proxy is redeployed you loose connection to the pod you were connected to and need to reconnect. A bit unsure if you see the failure on reconnect or before. If before, then it sounds like the connection was just dropped server side. If after.. could it be trying some 'half way reconnect' assuming some server side state that is no longer there?

@aslakknutsen it doesn't seem to be the root cause of the problem, if I understand correctly, because prod-preview Che server was redeployed just yesterday while the oso proxy was still running, and the error came back rapidly in the Che server logs (I would say as soon as a workspace is started and the Che server tries to keep a long-running websocket connection with it). This seems to mean that the error doesn't come from the Che server side due to a restart of the oso proxy. But tell me if I have misunderstood your point.

While possible, I think it's unlikely that it's related to that specific commit on the proxy side. I think it just happens to kinda correlate with the 2 weeks of logs we have.

I didn't just use the logs, but also Kibana precise requests and the dates of Openshift deployments for both the OSO proxy and the Che server. And the errors started happening precisely some minutes after the deployment of this OSO proxy commit for both production and prod-preview.

Shouldn't we find a way to monitor websocket connections between the Che server and the OSO proxy ? Something like running an additional container with tcpdump and analyzing the result with wireshark...

Or is there a way to enable more debugging on the oso proxy side to dump details websocket traffic ?

@nurali-techie
Copy link

@davidfestal yesterday this link was showing some log on 1-May-2018. Today, it's not showing any record, may be kibana keeping only last 15 days logs. But we are quite sure that the issue is not with git_commit as we have observed older logs before commit with same issue.

To cross check, kibana only keep 15 days logs - check this link. It has no filter other than date range "30-apr to 1-may" and no record is shown.

@davidfestal
Copy link
Collaborator

@nurali-techie anyway, as detailed in my previous comment, and what I see from the Java code on the Che server, I don't think that it is related to the Che server loosing the connection with the oso proxy because of an oso proxy restart. Afaict, there's a web socket communication that cannot be done correctly between both, and he Che server schedules a reconnect again and again on the Che server side. Is there any option to enable more logs / traces in websocket communications on the proxy side ?

@nurali-techie
Copy link

nurali-techie commented May 17, 2018

@davidfestal we are able to re-produce this error with small program which call OSO WebSocket API directly (no oso-proxy in between). It seems that OSO server giving this error to oso-proxy and oso-proxy logs (which observed in kibana) and return to client (in your case Che server).

Here is that small program.
OSO WS Client:

package main

import (
	"fmt"
	"log"
	"net/http"
	"net/url"
	"os"
	"time"

	"github.com/gorilla/websocket"
)

// Help: example run with arguments
// go run main.go api.starter-us-east-2a.openshift.com nvirani-preview xxxx
func main() {
	osoServer := os.Args[1]    // api.starter-us-east-2a.openshift.com
	osoNamespace := os.Args[2] // nvirani-preview
	osoToken := os.Args[3]

	addr := osoServer
	// path := fmt.Sprintf("/oapi/v1/namespaces/%s/buildconfigs", osoNamespace)
	path := fmt.Sprintf("/oapi/v1/namespaces/%s/builds", osoNamespace)

	u := url.URL{Scheme: "wss", Host: addr, Path: path, RawQuery: "watch=true"}
	reqHeader := http.Header{
		"Authorization": {fmt.Sprintf("Bearer %s", osoToken)},
		"Origin":        {"localhost:8080"},
	}

	log.Printf("Connecting to url:%s", u.String())
	c, _, err := websocket.DefaultDialer.Dial(u.String(), reqHeader)
	if err != nil {
		log.Fatal("dial:", err)
	}
	defer c.Close()

	for i := 0; ; i++ {
		start := time.Now()
		_, msg, err := c.ReadMessage()
		if err != nil {
			log.Printf("Got err:%s, after=%f sec", err, time.Since(start).Seconds())
			break
		}
		// log.Printf("msg-%d:%s", (i + 1), string(msg))
		log.Printf("Got msg-%d, with len=%d, after=%f sec", (i + 1), len(msg), time.Since(start).Seconds())
	}
}

Run Command: go run main.go api.starter-us-east-2a.openshift.com nvirani-preview xxxx
Output:

2018/05/17 13:01:26 Connecting to url:wss://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/nvirani-preview/buildconfigs?watch=true
2018/05/17 13:01:28 Got msg-1, with len=1135, after=0.000074 sec
2018/05/17 13:01:28 Got msg-2, with len=1121, after=0.000017 sec
2018/05/17 13:01:28 Got msg-3, with len=1121, after=0.000028 sec
2018/05/17 13:01:28 Got msg-4, with len=1121, after=0.000044 sec
2018/05/17 13:08:08 Got err:websocket: close 1006 (abnormal closure): unexpected EOF, after=399.570379 sec

This program call one of the OSO WebSocket API and call c.ReadMessage() multiple times. After getting 4 msg, call to c.ReadMessage() block for ~400 seconds and then return same error websocket: close 1006 (abnormal closure): unexpected EOF

Conclusion:
OSO WebSocket API returning 'close 1006' and which oso-proxy propogate to Che server. We should reach out to OSO team (i.e OpenShift team) to find out details on this behavior.

@nurali-techie
Copy link

nurali-techie commented May 17, 2018

@davidfestal I have tried with another OSO WebSocket API which doesn't giving this error close 1006 and waiting on c.ReadMessage() infinitely.
API Call2 - wss://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/nvirani-preview/builds?watch=true
It would be great if you can find out which OSO WebSocket API your Che server calls?

@riuvshin The issue seems to be with specific OpenShift WebSocket API and not all.

@riuvshin
Copy link
Contributor

riuvshin commented May 17, 2018

@nurali-techie so it does not work properly with this url:wss://api.starter-us-east-2a.openshift.com and which one was working properly for you?
cc: @jfchevrette

@nurali-techie
Copy link

nurali-techie commented May 17, 2018

@riuvshin
API with issue and return error - wss://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/nvirani-preview/buildconfigs?watch=true

API that works - wss://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/nvirani-preview/builds?watch=true

In one line, 'buildconfigs' API gives error while 'builds' API works.

@davidfestal
Copy link
Collaborator

@riuvshin @nurali-techie I debugged the prod-preview Che server to find out the context in which this error occurs, and more precisely the OpenShift resource the Che server is being watching when this error occurs.

And it showed that this error only occurs in a very specific use-case related to a specific type of POD that is started asynchronously to cleanup the PV from obsolete directories (with a rm -rf command) after the user has deleted a workspace from the Che server.

When looking into this specific code and debugging it, it seems to me that there might be something incorrect in the way the POD is created, run, watched and waited for. This finally results in the POD watcher trying to reconnect websockets again and again to a POD that, in fact, has already been deleted. A sort of race condition related to the fact that these PODs run for a very short amount of time.
This behavior could easily explain the 1006 code that we receive.

I'll investigate more to find out where, in the implementation, is the bug that leads to this situation.
However I can already say that I now fully consider this problem as a Che server problem.

@jfchevrette
Copy link
Contributor

Could this be a problem with the proxy? I'm told that k8s/openshift has no code that returns this to the client.

Ref https://bugzilla.redhat.com/show_bug.cgi?id=1579965

@davidfestal
Copy link
Collaborator

Let me summarize the root cause of this error and the fix: The error was happening only when the PV-cleaning POD could not be created (due to quotas presumably). In this case, the Che server was still trying to delete the POD, and thus also endlessly trying to watch for the end of the POD deletion, which in fact didn't occur. This is the root cause of the regular web-socket reconnect attempts that always led to the 1006 code.

Adding more robustness in the POD deletion method inside the Che server is the fix provided by PR eclipse-che/che#9932.

davidfestal added a commit to eclipse-che/che that referenced this issue Jun 5, 2018
…sting POD (#9932)

Fix the root cause of a recurring 1006 web-socket error.

The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672
Signed-off-by: David Festal <dfestal@redhat.com>
riuvshin pushed a commit to eclipse-che/che that referenced this issue Jun 6, 2018
…sting POD (#9932)

Fix the root cause of a recurring 1006 web-socket error.

The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672
Signed-off-by: David Festal <dfestal@redhat.com>
@l0rd l0rd mentioned this issue Jun 6, 2018
davidfestal added a commit to davidfestal/che that referenced this issue Jun 14, 2018
…sting POD (eclipse-che#9932)

Fix the root cause of a recurring 1006 web-socket error.

The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672
Signed-off-by: David Festal <dfestal@redhat.com>
riuvshin pushed a commit to eclipse-che/che that referenced this issue Jun 14, 2018
* Watch connection manager never closed when trying to delete a non-existing POD (#9932)

Fix the root cause of a recurring 1006 web-socket error.

The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672
Signed-off-by: David Festal <dfestal@redhat.com>

* CHE-5918 Add an ability to interrupt Kubernetes/OpenShift runtime start

Signed-off-by: Sergii Leshchenko <sleshche@redhat.com>

* CHE-5918 Add checking of start interruption by KubernetesBootstrapper

It is needed to avoid '403 Pod doesn't exists' errors.
It happens when start is interrupted when any of machines is on bootstrapping phase.
As result connection leak happens TODO Create an issue for fabric8-client

* Improve ExecWatchdog to rethrow exception when it occurred while establishing a WebSocket connection

* CHE-5918 Fix K8s/OS runtime start failing when unrecoverable event
occurs
hbhargav pushed a commit to hbhargav/che that referenced this issue Dec 5, 2018
…sting POD (eclipse-che#9932)

Fix the root cause of a recurring 1006 web-socket error.

The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672
Signed-off-by: David Festal <dfestal@redhat.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants