Errors "code 1006 is reserved" have appeared recently in OSIO Che logs #672
Comments
We should investigate on that during current sprint |
According to what I looked into the Che server Java code, the error that occurs during the call to Corresponding errors can be found in the OSO proxy, as show by the POD logs and Kibana logs. Further investigation on prod-preview showed that:
What we can see on production seems consistent: start of the last POD ( FTR the same error message in mentioned in several issues opened recently against traefik about websocket communication failures, such as traefik/traefik#2828 and traefik/traefik#3165 @aslakknutsen @nurali-techie Could it be that some changes in commit fabric8-services/fabric8-oso-proxy@d3dd6ee would have introduced this error ? for example some changes that would prevent websocket connections to be correctly received / taken in account ? |
@davidfestal I am looking into it and I will soon update the finding here. |
@davidfestal @nurali-techie The 1006 Code is something you see on the client side? Or something that comes from the server? It would make sense that if the proxy is redeployed you loose connection to the pod you were connected to and need to reconnect. A bit unsure if you see the failure on reconnect or before. If before, then it sounds like the connection was just dropped server side. If after.. could it be trying some 'half way reconnect' assuming some server side state that is no longer there? |
@davidfestal @nurali-techie While possible, I think it's unlikely that it's related to that specific commit on the proxy side. I think it just happens to kinda correlate with the 2 weeks of logs we have. |
@aslakknutsen it doesn't seem to be the root cause of the problem, if I understand correctly, because prod-preview Che server was redeployed just yesterday while the oso proxy was still running, and the error came back rapidly in the Che server logs (I would say as soon as a workspace is started and the Che server tries to keep a long-running websocket connection with it). This seems to mean that the error doesn't come from the Che server side due to a restart of the oso proxy. But tell me if I have misunderstood your point.
I didn't just use the logs, but also Kibana precise requests and the dates of Openshift deployments for both the OSO proxy and the Che server. And the errors started happening precisely some minutes after the deployment of this OSO proxy commit for both production and prod-preview. Shouldn't we find a way to monitor websocket connections between the Che server and the OSO proxy ? Something like running an additional container with tcpdump and analyzing the result with wireshark... Or is there a way to enable more debugging on the oso proxy side to dump details websocket traffic ? |
@davidfestal yesterday this link was showing some log on 1-May-2018. Today, it's not showing any record, may be kibana keeping only last 15 days logs. But we are quite sure that the issue is not with git_commit as we have observed older logs before commit with same issue. To cross check, kibana only keep 15 days logs - check this link. It has no filter other than date range "30-apr to 1-may" and no record is shown. |
@nurali-techie anyway, as detailed in my previous comment, and what I see from the Java code on the Che server, I don't think that it is related to the Che server loosing the connection with the oso proxy because of an oso proxy restart. Afaict, there's a web socket communication that cannot be done correctly between both, and he Che server schedules a reconnect again and again on the Che server side. Is there any option to enable more logs / traces in websocket communications on the proxy side ? |
@davidfestal we are able to re-produce this error with small program which call OSO WebSocket API directly (no oso-proxy in between). It seems that OSO server giving this error to oso-proxy and oso-proxy logs (which observed in kibana) and return to client (in your case Che server). Here is that small program.
Run Command: go run main.go api.starter-us-east-2a.openshift.com nvirani-preview xxxx
This program call one of the OSO WebSocket API and call Conclusion: |
@davidfestal I have tried with another OSO WebSocket API which doesn't giving this error @riuvshin The issue seems to be with specific OpenShift WebSocket API and not all. |
@nurali-techie so it does not work properly with this url:wss://api.starter-us-east-2a.openshift.com and which one was working properly for you? |
@riuvshin API that works - In one line, 'buildconfigs' API gives error while 'builds' API works. |
@riuvshin @nurali-techie I debugged the prod-preview Che server to find out the context in which this error occurs, and more precisely the OpenShift resource the Che server is being watching when this error occurs. And it showed that this error only occurs in a very specific use-case related to a specific type of POD that is started asynchronously to cleanup the PV from obsolete directories (with a When looking into this specific code and debugging it, it seems to me that there might be something incorrect in the way the POD is created, run, watched and waited for. This finally results in the POD watcher trying to reconnect websockets again and again to a POD that, in fact, has already been deleted. A sort of race condition related to the fact that these PODs run for a very short amount of time. I'll investigate more to find out where, in the implementation, is the bug that leads to this situation. |
Could this be a problem with the proxy? I'm told that k8s/openshift has no code that returns this to the client. |
Let me summarize the root cause of this error and the fix: The error was happening only when the PV-cleaning POD could not be created (due to quotas presumably). In this case, the Che server was still trying to delete the POD, and thus also endlessly trying to watch for the end of the POD deletion, which in fact didn't occur. This is the root cause of the regular web-socket reconnect attempts that always led to the 1006 code. Adding more robustness in the POD deletion method inside the Che server is the fix provided by PR eclipse-che/che#9932. |
…sting POD (#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>
…sting POD (#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>
…sting POD (eclipse-che#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>
* Watch connection manager never closed when trying to delete a non-existing POD (#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com> * CHE-5918 Add an ability to interrupt Kubernetes/OpenShift runtime start Signed-off-by: Sergii Leshchenko <sleshche@redhat.com> * CHE-5918 Add checking of start interruption by KubernetesBootstrapper It is needed to avoid '403 Pod doesn't exists' errors. It happens when start is interrupted when any of machines is on bootstrapping phase. As result connection leak happens TODO Create an issue for fabric8-client * Improve ExecWatchdog to rethrow exception when it occurred while establishing a WebSocket connection * CHE-5918 Fix K8s/OS runtime start failing when unrecoverable event occurs
…sting POD (eclipse-che#9932) Fix the root cause of a recurring 1006 web-socket error. The fixed bug is described / discussed in the following issue: redhat-developer/rh-che#672 Signed-off-by: David Festal <dfestal@redhat.com>
The text was updated successfully, but these errors were encountered: