Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket response are not complete #4446

Open
farodin91 opened this issue Jan 31, 2019 · 17 comments
Open

Websocket response are not complete #4446

farodin91 opened this issue Jan 31, 2019 · 17 comments
Labels
area/websocket kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. priority/P2 need to be fixed in the future

Comments

@farodin91
Copy link

Do you want to request a feature or report a bug?

Bug

What did you do?

I tried to run a JanusGraph through Traefik and simulate our load against traefik. JanusGraph is a graph database that uses Websockets for clients to connect to its server part.
Simulation of load means to run a lot Websocket requests with large requests and responses. Responses are often too large for a single Websocket message and are therefore sent via multiple messages. The protocol is explained in detail here.

What did you expect to see?

Some behaviour as nginx.

We currently run our janusgraph against a nginx as proxy with following config:

upstream upstream_writeable {
    server janusgraph:8182;
}

server {
    listen 8182 ssl;

    server_name janusgraph.tld;

    ssl_certificate     /etc/nginx/cert.crt;
    ssl_certificate_key /etc/nginx/cert.key;

    location / {
        proxy_pass http://upstream_writeable;
        proxy_http_version 1.1;
    }
}

What did you see instead?

Incomplete packet transport with websocket

Output of traefik version: (What version of Traefik are you using?)

Version:      v1.7.7
Codename:     maroilles
Go version:   go1.11.4
Built:        2019-01-08_10:21:03AM
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

defaultEntryPoints = ["http","https"]
checkNewVersion = false
ProvidersThrottleDuration = 5
MaxIdleConnsPerHost = 400
RootCAs = [ "/usr/local/share/ca-certificates/root.crt" ]

[traefikLog]
  filePath = "/dev/stdout"
  format   = "common"

[entryPoints]
  [entryPoints.gremlin]
  address = ":8182"
    [entryPoints.gremlin.tls]
      [[entryPoints.gremlin.tls.certificates]]
      CertFile = "/etc/ssl/sec.crt"
      KeyFile = "/etc/ssl/sec.key"

[docker]
endpoint = "unix:///var/run/docker.sock"
swarmmode = true
watch = true

[lifeCycle]
requestAcceptGraceTimeout = "60s"
graceTimeOut = "720s"

[respondingTimeouts]
idleTimeout = "720s"

[forwardingTimeouts]
dialTimeout = "720s"

docker-compose.yml for janusgraph

version: '3.4'

services:
  janusgraph:
    image: janusgraph:0.3.1
    networks:
      default:
      traefik:
    deploy:
      labels:
       - traefik.port=8182
       - traefik.docker.network=management
       - traefik.frontend.rule=HostRegexp:janusgraph.{domain:.+}
       - traefik.frontend.entryPoints=gremlin
      resources:
        reservations:
          memory: 1500M
          cpus: '0.5'
      mode: replicated
      replicas: 1

networks:
  default:
  traefik:
    external:
      name: management

docker-compose.yml for traefik

version: '3.3'
services:
  traefik:
    image: "docker_traefik_image"
    command: --configfile=/etc/traefik/traefik.toml --debug
    networks:
     - traefik
    ports:
      - target: 80
        published: 80
        protocol: tcp
        mode: host
      - target: 443
        published: 443
        protocol: tcp
        mode: host
      - target: 8182
        published: 8182
        protocol: tcp
        mode: host
    volumes:
     - /var/run/docker.sock:/var/run/docker.sock
    deploy:
      mode: global
      update_config:
        parallelism: 1
        delay: 5s
      placement:
        constraints:
         - node.role == manager

networks:
  traefik:
    external:
      name: management

docker-compose.yml for nginx

  nginx-proxy:
    image: "docker_nginx_proxy_image"
    ports:
      - 8182:8182
    networks:
      - default
    deploy:
      mode: global
      update_config:
        parallelism: 1
        delay: 5s
      placement:
        constraints:
         - node.role == manager

If applicable, please paste the log output in DEBUG level (--logLevel=DEBUG switch)

I see a lot of these log lines.

time="2019-01-31T07:43:55Z" level=debug msg="vulcand/oxy/forward: completed ServeHttp on request" Request="{\"Method\":\"GET\",\"URL\":{\"Scheme\":\"http\",\"Opaque\":\"\",\"User\":null,\"Host\":\"10.0.0.95:8182\",\"Path\":\"\",\"RawPath\":\"\",\"ForceQuery\":false,\"RawQuery\":\"\",\"Fragment\":\"\"},\"Proto\":\"HTTP/1.1\",\"ProtoMajor\":1,\"ProtoMinor\":1,\"Header\":{\"Connection\":[\"Upgrade\"],\"Sec-Websocket-Key\":[\"L2wtu2TZLEiVtaeiuCAR1A==\"],\"Sec-Websocket-Version\":[\"13\"],\"Upgrade\":[\"websocket\"]},\"ContentLength\":0,\"TransferEncoding\":null,\"Host\":\"janusgraph.tld:8182\",\"Form\":null,\"PostForm\":null,\"MultipartForm\":null,\"Trailer\":null,\"RemoteAddr\":\"172.18.0.1:51780\",\"RequestURI\":\"/gremlin\",\"TLS\":null}"
@mmatur mmatur added priority/P2 need to be fixed in the future area/websocket kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. and removed status/0-needs-triage labels Feb 4, 2019
@farodin91
Copy link
Author

Any progress?

@theolampert
Copy link

I can confirm I'm having the same issue. In my case I tried running Treafik (v1.7.10) in front of couchbase's sync gateway. When certain documents attempt to replicate (larger ones I'm guessing) the packets arrive at sync gateway incomplete.

@sinosoidal
Copy link

sinosoidal commented May 21, 2019

I'm having the same problem. I can't use Tarefik in front of couchbase sync gateway in order to upload attachments.

Is this something Traefik 2.0 fixes?

@meodemsao
Copy link

i have the same issue with socket

level=error msg="vulcand/oxy/forward/websocket: Error when copying from backend to client: websocket: close 1006 (abnormal closure): unexpected EOF"

@sinosoidal
Copy link

@meodemsao @theolampert @farodin91 I have been trying to understand what could be wrong with this. Since document replication works, why attachments? Don't is it because of any custom header? Is it because of some packet routing size limit? Does anyone has leads?

@meodemsao
Copy link

my system can't subscription when i change from nginx to traefik

@sinosoidal
Copy link

my system can't subscription when i change from nginx to traefik

What do you mean with "can't subscription"?

@theolampert
Copy link

@sinosoidal I experienced the issue with larger documents not just attachments, I've since switched out traefik for nginx and everything seems to be working fine again.

@meodemsao
Copy link

@sinosoidal i using websocket for pub/sub in my system

@meodemsao
Copy link

@theolampert i using nginx with docker-compose before but seem harder config than traefik.
you have any solution ?

@theolampert
Copy link

@meodemsao
Copy link

@theolampert yes, i also find this solution. I will try this

@sinosoidal
Copy link

sinosoidal commented May 24, 2019

@theolampert so the size matters. Someone from Couchbase shared this with me:

https://github.com/couchbase/couchbase-lite-core/issues/503

Someone using a Azure reverse proxy to server Sync Gateway was having the same kind of issue. It was related with the maximum packet size allowed. Maybe on Traefik the problem is similar.

Thoughts?

@ewoudwerkman
Copy link

ewoudwerkman commented Jul 25, 2019

I'm having the same issue with websockets using a Flask-SocketIO backend: a large data stream will be split up in multiple chunks that are not recognized by the Flask-SocketIO server. Without Traefik (latest stable version) in between it works flawlessly. Can we set the chuck size of Traefik somewhere?

It seems to be chuncked at 4096 bytes

@parra28
Copy link

parra28 commented Sep 11, 2019

Is there a solution to this problem?

@ti-mo
Copy link

ti-mo commented Oct 21, 2019

For us, this issue popped up when putting Traefik in between a Go server and an Elixir client. Traefik (or the Go runtime, or gorilla/websocket, or the websocket stdlib, we're not sure) seems to adapt the size of continuation frames as it sees fit, likely to make optimal use of buffers.

As far as I understand, it does this in an RFC-compliant manner, but the behaviour seems sufficiently rare for it to be ignored by quite a few websocket clients. If you're having disconnection issues through Traefik, try consuming the websocket using another library or runtime (Go always dealt with this just fine), or use a packet dump of the traffic received by your client to debug its websocket library.

The RFC states:

A sender MAY create fragments of any size for non-control messages.

We fixed this in websockex here: Azolo/websockex#79.

@ewoudwerkman was right in suggesting it being related to fragment size, but I'm not sure if it makes sense to expose this as configuration, as this may change based on the context (available memory, system load, etc.).

Either way, this is likely an issue with websocket message receivers (whether they're clients or servers) that should be fixed, since any sender could fragment traffic in this way.

@ewoudwerkman
Copy link

@ti-mo Thanks for your response.
It's good to know it is the websocket server needs to be fixed in this case (to be RFC complient) and not traefik (although traefik might be the source of the problem by fragmenting the packet in the first place ;-))

Since our Flask app is using uwsgi in production, I now knew where to look for a fix #unbit/uwsgi#1350 and for a solution in #unbit/uwsgi#1853 ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/websocket kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. priority/P2 need to be fixed in the future
Projects
None yet
Development

No branches or pull requests

10 participants