Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed response from particular host (only Windows, no timeout) #1879

Closed
hheimbuerger opened this issue May 25, 2020 · 9 comments
Closed

Delayed response from particular host (only Windows, no timeout) #1879

hheimbuerger opened this issue May 25, 2020 · 9 comments

Comments

@hheimbuerger
Copy link

hheimbuerger commented May 25, 2020

I'm investigating a stalling issue I have with some downstream libraries of urllib3 around a very particular host: https://amsmediusw-ak.studios.ms/
Requesting a response with urllib3 1.25.9 on Windows takes over 40s, but then ultimately gives the expected result of a 200 and a GUID in the response body:

urllib3.PoolManager().request('GET', 'https://amsmediusw-ak.studios.ms/').data

The oddity is that various other means of requesting a response from this host are returning virtually immediately, here's a subset of what I've tested:

  • Chrome on Windows (both as a regular page request, as well as with fetch in the dev tools: response in under 50ms
  • cURL on Linux (via WSL1): fast
  • urllib3 on Linux (via Glitch): fast

Here's the kicker, the response is also immediate if I set a connection timeout! — even though no timeout is actually triggered, or at least no exception is being raised.

>>> timeit.timeit("urllib3.PoolManager().request('GET', 'https://amsmediusw-ak.studios.ms/',
                           timeout=urllib3.Timeout(connect=0.1)).data",
                  setup="import urllib3", number=1)
0.377045900000212
>>> timeit.timeit("urllib3.PoolManager().request('GET', 'https://amsmediusw-ak.studios.ms/').data",
                  setup="import urllib3", number=1)
42.16889579999997

(Setting a read timeout but no connection timeout has no effect (40+s).)

Reusing the pool manager also means future responses will be immediate. This seems like a major clue to help you understand what's going on here, but I cannot tell.

The timing appears to be very stable at between 42.1 and 42.2s, with minor fluctuations making it further unlikely that any routing issues are going on:

>>> timeit.repeat("urllib3.PoolManager().request('GET', 'https://amsmediusw-ak.studios.ms/').data", setup="import urllib3", number=1, repeat=5)
[42.18481859999997, 42.1579299, 42.24814900000001, 42.16824580000002, 42.18917329999999]

Making a request via PowerShell (Invoke-WebRequest https://amsmediusw-ak.studios.ms/ -UseBasicParsing) is also slow. Indicating that this is rooted somewhere in the Windows sockets implementation, possibly some kind of default connection timeout?

@hodbn
Copy link
Member

hodbn commented May 25, 2020

It works virtually immediately on my machine (Windows, urllib3=1.24.3).
Does this happen with HTTP? Can you post a packet capture with HTTP?

@hheimbuerger
Copy link
Author

Yeah, same effect with HTTP.

But when I tested a few minutes ago, I also got immediate responses, and now it's back to 42s (on both HTTP and HTTPS), so I guess there's some kind of rate limiting going on. I just don't understand how Chrome is immune from it, but apparently there's got to be a small difference in how I make my request.

Do you have any idea why setting a timeout might 'fix' the issue? Is the protocol-level request that gets sent in any way different if a timeout is enabled?

@sigmavirus24
Copy link
Contributor

But when I tested a few minutes ago, I also got immediate responses, and now it's back to 42s (on both HTTP and HTTPS), so I guess there's some kind of rate limiting going on. I just don't understand how Chrome is immune from it, but apparently there's got to be a small difference in how I make my request.

User-Agent strings probably, but it wouldn't explain why the connect timeout makes a difference. My best guess for why the connect timeout helps is:

There are multiple DNS entries and browsers cache the working IP Addresses that they've looked up and used previously. Urllib3 looks up the DNS each time and the ordering of the DNS entries is non-deterministic so we get "bad" addresses for that URL more often than not. The connect timeout means we don't spend too much time on those bad hosts.

@hheimbuerger
Copy link
Author

User-Agent strings probably

I did sent Chrome's User-Agent string, that's why I initially dropped down to the level of urllib3. It didn't seem to make a difference, but there might be some finer details I'm missing — and I then got distracted by the timeout 'fix'.

There are multiple DNS entries and browsers cache the working IP Addresses that they've looked up and used previously. Urllib3 looks up the DNS each time and the ordering of the DNS entries is non-deterministic so we get "bad" addresses for that URL more often than not. The connect timeout means we don't spend too much time on those bad hosts.

Interesting theory. Can you think of any experiments I should run to validate this, or break it down further?

I guess that would explain why Chrome isn't affected here, because IIRC it is hardcoded into using Google DNS as opposed to the system-configured DNS (currently 1.1.1.1).

@hheimbuerger
Copy link
Author

hheimbuerger commented May 27, 2020

One more data point I didn't think about earlier, that I just tested:

The response time is not correlated to the length of the timeout set!

If I set a connection timeout of 0.2s, I get a sub-second total response. But if I set a connection timeout of 10s, I still get sub-second total response times.

My interpretation is that the timeout doesn't cause any internal process to actually time out (e.g. "spending time on a bad host in DNS rotation") and then causing the system to recover from it, but rather the very act of setting a timeout seems to affect the behavior of the system.

>>> def test(timeout): return timeit.repeat(f"urllib3.PoolManager().request('GET', 'http://amsmediusw-ak.studios.ms/', timeout=urllib3.Timeout(connect={timeout})).data", setup="import urllib3", number=1, repeat=5)
...
>>> test(timeout=0.2)
[0.05902539999806322, 0.055709000007482246, 0.07046360001550056, 0.07770219998201355, 0.06996609998168424]
>>> test(timeout=10)
[0.07713540000258945, 0.06717739999294281, 0.06995289999758825, 0.07092060000286438, 0.07774379997863434]

@hodbn
Copy link
Member

hodbn commented Jun 5, 2020

Can you post a packet capture of the two cases?

@hheimbuerger
Copy link
Author

hheimbuerger commented Jun 6, 2020

Can you post a packet capture of the two cases?

That's a good idea, and I'm curious too. I'll look into it. Give me a few days, please.

@hheimbuerger
Copy link
Author

hheimbuerger commented Jun 13, 2020

I have Wireshark set up for capturing now, but cannot get into the (presumed) rate limited state anymore. I've downloaded about 10 GiB from this host, but that didn't trigger it.

If anyone has an idea or better reproduction steps, please let me know. I'll also try some more the next days (far fetched, but maybe there's a day of the week-component?).

Otherwise, I'll close this issue in a couple of days.

@hheimbuerger
Copy link
Author

Thanks to everyone for your suggestions! I'm really curious about this one and would like to capture and investigate further. Unfortunately, I'm currently unable to get the system into this supposed 'rate limited' state.

Therefore, I'm closing this issue for now. Should I ever manage to reproduce, I'll reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants