Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first_byte_timeout is ignored when re-using a backend connection (HTTP Keep Alive) #1772

Closed
bsdphk opened this issue Mar 4, 2016 · 3 comments · Fixed by #2504
Closed

first_byte_timeout is ignored when re-using a backend connection (HTTP Keep Alive) #1772

bsdphk opened this issue Mar 4, 2016 · 3 comments · Fixed by #2504

Comments

@bsdphk
Copy link
Contributor

bsdphk commented Mar 4, 2016

Old ticket imported from Trac
See archived copy here: https://varnish-cache.org/trac/ticket/1772

@mbgrydeland
Copy link
Contributor

I've been pondering this and #1806 a bit.

To me it seems that the base problem relates to the inability to forcefully remove a vbc from the waiter code. We used to have that option in the API, but it proved racy and was removed at some point during the 4.1 development cycle.

To fix the race issue, reused vbc's would start off in a STOLEN state. STOLEN here meaning that the waiter has the fd, and has not yet released it from it's in kernel event monitoring. The VBT_Wait function would wait for the waiter to release it, and the fetch code would take care to call VBT_Wait before starting to read or poll on the socket. But for the waiter release to happen, some event would need to register on the fd. The event could either be incoming bytes, a remote hangup or the waiter implemented timeout. The waiter timeout set for vbc's is backend_idle_timeout (default 60s). Lacking any incoming bytes or hangup, this translates to the first_byte_timeout not being used, but instead a timeout up to backend_idle_timeout is used.

To remedy this I suggest taking a slightly different approach. Instead of forcing a waiter exit first and then doing the reads on the vbc, we should use the waiter to do the first read timeout instead. To do this, VBT_Wait would go away, and be replaced with a VBT_Read taking a read timeout as an argument and is tasked with both reading bytes and implementing the read timeout. Fetch code would then call VBT_Read to read bytes, providing first_byte_timeout on the first call, and between_bytes_timeout on subsequent calls.

VBT_Read would upon finding that the vbc is still on the waiter, call into the waiter API to change the timeout. This would require changes to the waiter API, but looks doable (would need to trigger a loop in the waiter thread through the control socket like when inserting new entries). VBT_Read then does a pthread_cond_wait like before waiting for the waiter event. The waiter event would remove the fd from its set, so this slower path would only happen once. When the vbc is not on the waiter, a poll with timeout is done instead.

The waiter API for changing the timeout would have to be able to return failure. This could happen if the call to change the timeout happens at the same time as a socket event happens. The calling code in VBT_Read would need to handle that.

It is my belief that these changes would also fix #1806.

Martin

@bsdphk
Copy link
Contributor Author

bsdphk commented Mar 18, 2016

Yeah, I've had thoughts along the same lines myself, but it's more complexity than I like...

fgaillot added a commit to fgaillot/varnish-cache that referenced this issue Jun 16, 2017
This avoids getting stuck with a stalled backend connection until
reaching backend_idle_timeout.

We now wait for the waiter to release the fd for a limited amount of
time (first_byte_timeout). In case we reach the timeout, the backend
connection is killed to make sure we are not going to reuse it.

Fixes varnishcache#1772.
@fgaillot
Copy link

Hi,

I got into the same issue describe above. I'm not familiar with varnish code but I gave it a try and a wrote a patch for this. I tried a different approach from the one proposed by @mbgrydeland: the VBT_Wait() function now takes a timeout argument (set to first_byte_timeout) to abort before reaching backend_idle_timeout. If the timeout is reached, the backend connection is then killed to avoid reusing it.

This approach looks simpler but may not be convenient for some reason I didn't see. Please fell free to give any feedbacks or remarks on this patch.

See #2347.

Thx.

PS: I tried this patch on 5.1 and 4.1 branch only, but the check-pick on master applies without conflicting. make check is successful on both branches.

daghf added a commit to daghf/varnish-cache that referenced this issue Nov 24, 2017
daghf added a commit to daghf/varnish-cache that referenced this issue Nov 24, 2017
daghf added a commit to daghf/varnish-cache that referenced this issue Nov 24, 2017
daghf added a commit to daghf/varnish-cache that referenced this issue Nov 24, 2017
daghf added a commit to daghf/varnish-cache that referenced this issue Nov 24, 2017
daghf added a commit to daghf/varnish-cache that referenced this issue Nov 26, 2017
dmatetelki pushed a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants