Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploads don't start until a client connects, fail in multi-instances environment #3538

Closed
gabiganam opened this issue Mar 7, 2022 · 18 comments
Assignees
Labels
Companion The auth server (for Instagram, GDrive, etc) and upload proxy (for S3) 💬 Discussion Question

Comments

@gabiganam
Copy link
Contributor

gabiganam commented Mar 7, 2022

I'm not sure I understand the logic behind this mechanism, IMO getting the progress is not worth not starting the transfer.

From upload.js:

      // wait till the client has connected to the socket, before starting
      // the download, so that the client can receive all download/upload progress.
      logger.debug('Waiting for socket connection before beginning remote download/upload.', null, req.id)
      await uploader.awaitReady()
      logger.debug('Socket connection received. Starting remote download/upload.', null, req.id)

From Uploader.js:

  async awaitReady () {
    // TODO timeout after a while? Else we could leak emitters
    logger.debug('waiting for socket connection', 'uploader.socket.wait', this.shortToken)
    await new Promise((resolve) => emitter().once(`connection:${this.token}`, resolve))
    logger.debug('socket connection received', 'uploader.socket.wait', this.shortToken)
  }

I'm setting up multiple companion pods in a k8s cluster, so it happens that an upload starts in 1 instance and the client connection is received by a different instance.
Resulting in the first instance "awaitReady" being stuck forever, and the upload never starts.

So basically my questions are:

  1. What am I missing in my multiple-instances setup?
  2. Can we please make this logic of waiting for client connection configurable? or have a configurable timeout as the TODO comment suggests?
  3. I set up a Redis as the documentation suggests, why isn't this allowing for the second instance (that receives the connection) to commence the upload?
@Murderlon
Copy link
Member

Uppy and companion instances are 'married' to each other, meaning you can't do a multi instance setup with picking up the upload later in a different companion instance. Companion isn't stateless, it stores a lot of data in memory. We talked about this extensively at Transloadit, but the conclusion was not to pursue a huge rewrite to make this happen.

Perhaps @kvz can shed some light on how one of Transloadit's customers solved their scaling issue in a similar way.

@Mosharush
Copy link

Mosharush commented Mar 8, 2022

Hey @Murderlon,
I am also tried this and seems can't scale the companion server.
But I do not understand why we need Redis at all if there is have some data ("a lot of") in memory, why not save it on Redis so It's can be full stateless?

@Murderlon Murderlon assigned kvz and mifi Mar 8, 2022
@mifi
Copy link
Contributor

mifi commented Mar 8, 2022

Hi

IMO getting the progress is not worth not starting the transfer

I agree, but it's not just about progress, but about synchronisation between client and server. If we did not wait here and the upload finishes before the client has connected to the socket, then the client would never know that the upload finished so it would hang forever in the "uploading" state. This could probably be fixed but it could require a new major version of both uppy and companion.

  • What am I missing in my multiple-instances setup?

For now, if you want to do multi-instance setup I think it should still be possible with sticky sessions.

  • Can we please make this logic of waiting for client connection configurable? or have a configurable timeout as the TODO comment suggests?

Actually by chance, unrelated to this, I fixed this just now in #3544 by implementing a timeout.

As for the redis configuration and logic, I'm not 100% sure what's the purpose, but I can see from the docs:

redisUrl(optional) - URL to running Redis server. If this is set, the state of uploads would be stored temporarily. This helps for resumed uploads after a browser crash from the client. The stored upload would be sent back to the client on reconnection.

From the code we can see that when redis is enabled in companion config, I can see that the following happens:

However I don't see how this helps much when uploads cannot be started on one server and continue on a different one.

@gabiganam
Copy link
Contributor Author

Thanks @Murderlon, I'd love to hear more details about how your customers configured scaling!
@kvz any info you can share will be highly appreciated.

@mifi
Copy link
Contributor

mifi commented Mar 15, 2022

Looking at uppy.io, I can see that the companion keeps trying to send the response header Set-Cookie: connect.sid=xyz on every request, and for every request there is a new session ID being created. This is because Uppy client does not send the Cookie header with the cookie that it has received. So actually in the default configuration, I think scaling and sessions does not work.

I think what needs to be done to support cookies (and thereby sticky sessions in a load balancer) is to enable withCredentials: true and add to CORS allowed headers: Cookie as well as some other things. Only when the client sends a cookie with a fixed connect.sid for every request, can this cookie be used in the load balancer to enable stickyness.

@mifi
Copy link
Contributor

mifi commented Mar 16, 2022

I've done some more research and I've found that companion does indeed not support true fault-tolerant non-sticky load balancing with multiple companion instances behind one (sub)domain (where any client request can go to any instance).

Instead it supports a different type of scaling where each companion instance needs to have its own domain name. e.g. companion1.example.com, companion2.example.com, ...
There will be one "master" companion instance that handles all authentication, in order to avoid having to set up separate callbacks URL for each instance's subdomain on the OAuth providers' admin page. This master instance is configured using the options oauthDomain and validHosts.

This means that you don't need to set up sticky sessions, redis or anything like this, because Uppy/Companion will handle stickiness internally (actually the express session code isn't really in use AFAIK.)

The only thing that you need to do is to set companionUrl in Uppy to a URL with a randomised companion instance from a list of subdomains.

I found this information in #1845 and #2321

I will update the documentation because this is a frequently requested feature.

@mifi
Copy link
Contributor

mifi commented Mar 16, 2022

Actually when I look more closely at the code, unless my brain is too tired, I can see that socket.js does indeed proxy/forward all websocket messages to redis.

emitter().emit(`${data.action}:${token}`)

This would imply that we do actually support fault-tolerant scaling between multiple companion instances on the same domain 🤯

consider this:

  • we have 2 companion instances connected over redis
  • client connects to companion 1 instance and calls /get on file.
  • companion 1 stops at await uploader.awaitReady() which waits for the connect:${this.token} event send over redis
  • client connects via websockets to companion 2 instance and this instance broadcasts the connect:${this.token} event over redis
  • companion 1 continues the upload and emits progress/success events over redis
  • companion 2 relays these events from redis back over the websocket

So if this theory is correct, then I think it should just work if you enable redis in your setup. In such case we need to update our docs also.

The explanation for why express sessions are being used (even though they don't seem to work over XHR) is this:
Grant uses express sessions for sending data between redirects, so that we don't need to expose secret tokens in URLs https://github.com/simov/grant#callback-session - after that the session is no longer used. And because these are not XHR requests coming from Uppy, the CORS configuration doesn't matter (the browser is making requests directly to companion in a new tab/window!)

If this really works, I think we should make an integration or e2e test for running with redis and a load balancer, to confirm that switching between two instances on upload really works.

@Murderlon Murderlon added 💬 Discussion Companion The auth server (for Instagram, GDrive, etc) and upload proxy (for S3) labels Mar 17, 2022
@Mosharush
Copy link

Mosharush commented Mar 20, 2022

The problem is this code waiting for socket connect start for start file upload:

await uploader.awaitReady()

So even if the companion got a socket connection to another pod it's not handling file upload.

We need to add some listener on socket connect to handle the upload as stateless.

@mifi
Copy link
Contributor

mifi commented Mar 20, 2022

@Mosharush see my previous comment. Because other instances will broadcast a connect:token event over redis when websocket connects, awaitReady should return when it receives this redis event

@Murderlon
Copy link
Member

Can we turn this issue into something actionable so we work towards closing it? What needs to happen? Code changes? Docs?

@mifi
Copy link
Contributor

mifi commented Mar 21, 2022

Someone needs to test this whole setup with two companion instances connected over redis and a load balancer that balances requests round robin (every new request goes to a different instance)

Or we must set up an automated system test for this (a lot more work)

@Murderlon
Copy link
Member

Perhaps we can quickly test it first ourselves and see if it works. If we decide to support it there should be a test for it in my opinion. But that can come later.

@Murderlon
Copy link
Member

But AFAIK two people here already tested it and it didn't work with redis enabled so I suppose we can go straight into deciding whether we want to support this?

@gabiganam
Copy link
Contributor Author

IMO it's critical to support this in order to have a scaleable solution :)

@transloadit transloadit deleted a comment from cmsnegar Mar 21, 2022
@mifi
Copy link
Contributor

mifi commented Mar 21, 2022

I agree that it's a critical feature for being able to use companion in real-world applications that need to scale. And because I think we are already so close to supporting it (if not already supported), we should do the little effort of testing if it works. @gabiganam did you happen to see if redis events were coming in at all? like connect:token when websocket connection was established

@gabiganam
Copy link
Contributor Author

@mifi In my test environment it does not work as expected :(
I run 2 companion instances, 1 gets the upload request and awaits the connection, the 2nd companion gets the connection, but the first doesn't continue.
We have Redis defined in both of course.

See logs from companion 1:
Screen Shot 2022-03-21 at 17 53 46

Companion 2:
Screen Shot 2022-03-21 at 17 54 40

@mifi
Copy link
Contributor

mifi commented Oct 31, 2022

Hi again. I've done some more testing around this, and I've set up a simple load balancer (reverse proxy) based on http-proxy in front of two companion instances, so that every even request goes to one companion and odd requests to the other.

I've observed that the '/drive/get/x' request goes to companion 1 while the websocket /api/y request goes to companion 2.

Companion 1 http://127.0.0.1:3021 logs

companion: 2022-10-31T10:11:42.817Z [debug] 1fdb5dcd-83d0-4e6f-8059-76b3ea7c2579null Instantiating uploader.
companion: 2022-10-31T10:11:42.818Z [debug] 1fdb5dcd-83d0-4e6f-8059-76b3ea7c2579null Starting download stream.
companion: 2022-10-31T10:11:47.047Z [debug] 1fdb5dcd-83d0-4e6f-8059-76b3ea7c2579null Waiting for socket connection before beginning remote download/upload.
companion: 2022-10-31T10:11:47.047Z [debug] 35e0f960 uploader.socket.wait waiting for socket connection
::ffff:127.0.0.1 - - [31/Oct/2022:10:11:47 +0000] "POST /drive/get/fileId HTTP/1.1" 200 48 "http://127.0.0.1:5173/" "Mozilla"
companion: 2022-10-31T10:11:47.077Z [debug] 35e0f960 uploader.socket.wait socket connection received
companion: 2022-10-31T10:11:47.077Z [debug] 1fdb5dcd-83d0-4e6f-8059-76b3ea7c2579null Socket connection received. Starting remote download/upload.
companion: 2022-10-31T10:11:47.077Z [debug] 35e0f960 controller.get.provider.size need to download the whole file first
companion: 2022-10-31T10:11:47.077Z [debug] 35e0f960 uploader.download fully downloading file
companion: 2022-10-31T10:11:47.079Z [debug] 35e0f960 uploader.total.progress 7264 151981 4.78%
companion: 2022-10-31T10:11:48.107Z [debug] 35e0f960 uploader.total.progress 7645 151981 5.03%
companion: 2022-10-31T10:11:50.254Z [debug] 35e0f960 uploader.total.progress 41785 151981 27.49%
companion: 2022-10-31T10:11:51.300Z [debug] 35e0f960 uploader.total.progress 44541 151981 29.31%
companion: 2022-10-31T10:11:53.013Z [debug] 35e0f960 uploader.total.progress 64370 151981 42.35%
companion: 2022-10-31T10:11:55.044Z [debug] 35e0f960 uploader.total.progress 74629 151981 49.10%
companion: 2022-10-31T10:11:55.101Z [debug] 35e0f960 uploader.download finished fully downloading file
companion: 2022-10-31T10:12:02.700Z [debug] 35e0f960 uploader.total.progress 75990 151981 50.00%
companion: 2022-10-31T10:12:04.655Z [debug] 35e0f960 uploader.total.progress 151981 151981 100.00%
companion: 2022-10-31T10:12:04.656Z [debug] 35e0f960 cleanup

Companion 2 http://127.0.0.1:3022 logs

companion: 2022-10-31T10:11:47.073Z [info] socket.connect connection received from 35e0f960-70b3-4656-a469-d0c66543512d

Load balancer http://127.0.0.1:3020 logs

req POST http://127.0.0.1:3021 /drive/get/fileId
upgrade http://127.0.0.1:3022 /api/35e0f960-70b3-4656-a469-d0c66543512d

As you can see, Companion 1 gets the first /drive/get request and logs waiting for socket connection, then after the Companion 2 receives its websocket request, it forwards this connection:token event onto redis, and Companion 1 logs socket connection received because its promise resolves upon receiving this event from redis. So to me it looks like it's working as intented.

Here is the code of the simple load balancer that I created:

const http = require('http');
const httpProxy = require('http-proxy');

const proxy = httpProxy.createProxyServer({ ws: true });

let i = 0;

const getTarget = () => (i % 2 === 0 ? 'http://127.0.0.1:3021' : 'http://127.0.0.1:3022');

const server = http.createServer(function(req, res) {
  const target = getTarget();
  console.log('req', req.method, target, req.url);
  proxy.web(req, res, { target });
  i++;
});

server.on('upgrade', function (req, socket, head) {
  const target = getTarget();
  console.log('upgrade', target, req.url);
  proxy.ws(req, socket, head, { target });
  i++;
});

console.log('listening');
server.listen(3020);

Here is my companion configuration:

companion1.sh:

env \
COMPANION_DATADIR="./output1" \
COMPANION_DOMAIN="localhost:3020" \
COMPANION_PROTOCOL="http" \
COMPANION_PORT=3021 \
COMPANION_CLIENT_ORIGINS="" \
COMPANION_SECRET="development" \
COMPANION_ALLOW_LOCAL_URLS="true" \
COMPANION_REDIS_URL=redis://localhost:6379 \
COMPANION_GOOGLE_KEY="key" \
COMPANION_GOOGLE_SECRET="secret" \
nodemon --watch packages/@uppy/companion/src --exec node ./packages/@uppy/companion/src/standalone/start-server.js

companion2.sh: same as companion1.sh but with COMPANION_PORT=3022

  • Are you running the latest version of Companion?
  • Could you try to run my proxy and see if that works?
  • Could you provide more information about your setup to help me reproduce the problem?

@mifi
Copy link
Contributor

mifi commented Nov 21, 2022

We have now deployed a multi-instance setup of Companion on uppy.io (runs behind Heroku's built in SSL terminating router / load balancer).

So when you upload a file there, all the requests will be distributed between two companion instances. This seems to be working nicely, so I will consider it a proof that multi-instance companion works as expected.

There were a couple of things I had to make sure for it to work:

  • The two instances obviously need to be connected to the same redis server
  • You need to set COMPANION_SECRET to the same value on both servers (and optionally COMPANION_PREAUTH_SECRET if you use that)

I'm going to close this but can reopen if people are still having trouble

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Companion The auth server (for Instagram, GDrive, etc) and upload proxy (for S3) 💬 Discussion Question
Projects
None yet
Development

No branches or pull requests

5 participants