Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graceful shutdown - preliminary attempt #415

Closed
wants to merge 4 commits into from
Closed

graceful shutdown - preliminary attempt #415

wants to merge 4 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Aug 20, 2020

As requested, here is my PR to resolve #395

I tested this last night and the behavior is as follows:

  • when uploading, the server waits until the active handler being executed is complete (so a chunk can be written) before shutting down
  • if tusd is then started back up (and I resume the upload) it picks back up where it left off

In my testing, I killed tusd with an active upload at 34%, started tusd back up, and it picked right back up at 35% like nothing had happened. Then it completed successfully. This was just a small test but exactly the behavior desired/expected, I think?

Main questions: what do you think about the TODO markers? The shutdown timeout, etc.? I'm guessing these should become CLI flags.

@ghost ghost mentioned this pull request Aug 20, 2020
Copy link
Member

@Acconut Acconut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this interesting PR! I have left a few comments for further discussion.

cmd/tusd/cli/serve.go Outdated Show resolved Hide resolved
cmd/tusd/cli/serve.go Show resolved Hide resolved
cmd/tusd/cli/serve.go Show resolved Hide resolved
cmd/tusd/cli/serve.go Outdated Show resolved Hide resolved
@Acconut
Copy link
Member

Acconut commented Aug 26, 2020

I left two comments on your changes. Please check them out :)

On another topic: How does this behave with long running requests (e.g. a PATCH request transferring multiple GBs for hours)? Is this request interrupted when the shutdown begins or only after the timeout has been reached? Is the request interrupted "nicely" by closing the request body or violently by killing the TCP connection? Did you try that out and can shine light on it?

@ghost
Copy link
Author

ghost commented Aug 28, 2020

I left two comments on your changes. Please check them out :)

On another topic: How does this behave with long running requests (e.g. a PATCH request transferring multiple GBs for hours)? Is this request interrupted when the shutdown begins or only after the timeout has been reached? Is the request interrupted "nicely" by closing the request body or violently by killing the TCP connection? Did you try that out and can shine light on it?

The only test I performed was the originally specified one in the PR description. The data set size is sub-gigabyte and performed locally. (edit: forgot to mention, 500mb~)

The request is never interrupted. Shutdown fails if in-flight requests are not completed prior to the shutdown's context timeout (see my other comments.)

@Acconut
Copy link
Member

Acconut commented Sep 7, 2020

The request is never interrupted. Shutdown fails if in-flight requests are not completed prior to the shutdown's context timeout

Does that mean that the server cannot shutdown while requests are still open to it? I think that this would be a good decision. I would expect tusd, once asked to shutdown, to wait a few seconds for all requests to end while not accepting new requests. If the shutdown timeout is reached it should end all still active requests by force. Is this behavior implemented by your PR or am I missing something?

@ghost
Copy link
Author

ghost commented Sep 8, 2020

The request is never interrupted. Shutdown fails if in-flight requests are not completed prior to the shutdown's context timeout

Does that mean that the server cannot shutdown while requests are still open to it? I think that this would be a good decision. I would expect tusd, once asked to shutdown, to wait a few seconds for all requests to end while not accepting new requests. If the shutdown timeout is reached it should end all still active requests by force. Is this behavior implemented by your PR or am I missing something?

This is not how Go's built-in Shutdown API works. Please see the documentation for the best clarification on this matter. Active requests will not be ended by force if the timeout is reached - that would defeat the purpose of graceful shutdown.

@Acconut
Copy link
Member

Acconut commented Sep 10, 2020

Active requests will not be ended by force if the timeout is reached - that would defeat the purpose of graceful shutdown.

Ok, that's good to know. However, it's not optimal if an active request (which might be open for hours) prevents tusd from shutting down. Does you PR already handle that in some case? The usage of run.Group makes those question very hard for me to answer by looking at the code.

@ghost
Copy link
Author

ghost commented Sep 14, 2020

Active requests will not be ended by force if the timeout is reached - that would defeat the purpose of graceful shutdown.

Ok, that's good to know. However, it's not optimal if an active request (which might be open for hours) prevents tusd from shutting down. Does you PR already handle that in some case? The usage of run.Group makes those question very hard for me to answer by looking at the code.

Remember that part where I was setting ReadTimeout and WriteTimeout and you mentioned you already have timeout handlers in place? What do those timeout handlers do and/or can you point out where they are in the code?

Thanks,
Ryan

@ghost ghost closed this Sep 14, 2020
@ghost
Copy link
Author

ghost commented Sep 14, 2020

Closed by accident. No coffee yet this morning, hahaha.

@ghost ghost reopened this Sep 14, 2020
@ghost
Copy link
Author

ghost commented Sep 17, 2020

Hi again,

Just following up. I see ReadTimeout and WriteTimeout in the net.Listener code that tusd uses. This answers your last question, right? There's nothing to do on my end here?

@Acconut
Copy link
Member

Acconut commented Sep 18, 2020

Remember that part where I was setting ReadTimeout and WriteTimeout and you mentioned you already have timeout handlers in place? What do those timeout handlers do and/or can you point out where they are in the code?

Our timeout handling is located here: https://github.com/tus/tusd/blob/master/cmd/tusd/cli/listener.go Its purpose is to reset the read/write timeouts whenever a read/write operation occurs. Go does not support that kind of timeouts on its own, so we had to build it for ourselves. Let me know if you need more in-depth explanation.

This answers your last question, right? There's nothing to do on my end here?

To be honest, I am still very much confused on how this patch behaves when requests are still open. Even with your explanations I am not able to understand it from just looking at the code. I would have to try it on my own and see how it works.

Also, there is still the usage of run.Group. You said that you are not interested in removing it (#415 (comment)) which is absolutely in your rights. It will take some time until I can tackle this on my own.

All in all, this PR is not in a state where I can confidently merge it into master since it has a lot of potential for breakage. I'm sorry that all your efforts end in such an unsatisfying situation.

@ghost
Copy link
Author

ghost commented Sep 22, 2020

That's fair. Let me see if I can come up with something else on my end to try to address some of the issues remaining. If I can come up with a solid explanation for how the run.Group functionality works, would you be comfortable with merging based on that? Or are there other blockers?

I think there's a reason why rewriting it away from oklog/run may be less than beneficial, but my research will hopefully figure that out. I don't want to be attached to it unnecessarily either if it's introducing additional complexity. Time is tight on my end though, hence why I'm hoping to avoid rewriting the patch if I can.

@Acconut
Copy link
Member

Acconut commented Sep 29, 2020

That's fair. Let me see if I can come up with something else on my end to try to address some of the issues remaining.

That's great, thank you for understanding our point of view as well :)

If I can come up with a solid explanation for how the run.Group functionality works, would you be comfortable with merging based on that?

The chances are low, I would say. The best code is still the one which does not need explanation for a reasonably experienced Go developer. If it's obvious that the code without run.Group is more complex, I am happy to accept it as a dependency. But unless that has been shown, I would want to avoid it.

I think there's a reason why rewriting it away from oklog/run may be less than beneficial, but my research will hopefully figure that out.

If there is such a reason, I am open to accept run.Group :)

@ssikyou
Copy link

ssikyou commented Apr 27, 2021

Any update on this feature?

@Acconut
Copy link
Member

Acconut commented Apr 28, 2021

Any update on this feature?

Sadly no. This PR is still missing a feature where the currently open upload requests get gracefully closed down (in addition to the already implemented stop of accepting new upload requests). Also, I disagree with the author about the use of a dependency and they clearly indicated that they are not interested in updating the patch (which they completely have the right to).

However, maybe in the near future I will attempt to implement this on my own, as gracefully closing open requests is a somewhat harry thing to implement without more knowledge about the internals of tusd.

@ghost ghost closed this by deleting the head repository Mar 16, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gracefully terminate on SIGTERM
2 participants