Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Enforce DNS resolution timeout #2496

Merged
merged 1 commit into from Feb 2, 2017
Merged

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Jan 13, 2017

Fixes #2461

As @rolando noticed, DNS_TIMEOUT setting was never actually used by Scrapy with 14<=Twisted<=16.6, because a (1, 3, 11, 45) tuple was always passed to the DNS resolver by Twisted internally:

Successfully installed twisted-14.0.2
$ scrapy fetch --headers https://www.example.com -s DNS_TIMEOUT=15
2017-01-13 16:06:33 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
2017-01-13 16:06:33 [scrapy.utils.log] INFO: Overridden settings: {'DNS_TIMEOUT': '15'}
(...)
CachingThreadedResolver.getHostByName(name='www.example.com', timeout=(1, 3, 11, 45))
2017-01-13 16:06:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com> (referer: None)


Successfully installed twisted-15.5.0
$ scrapy fetch --headers https://www.example.com -s DNS_TIMEOUT=15
2017-01-13 16:07:05 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
2017-01-13 16:07:05 [scrapy.utils.log] INFO: Overridden settings: {'DNS_TIMEOUT': '15'}
(...)
CachingThreadedResolver.getHostByName(name='www.example.com', timeout=(1, 3, 11, 45))
2017-01-13 16:07:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com> (referer: None)


Successfully installed twisted-16.6.0
$ scrapy fetch --headers https://www.example.com -s DNS_TIMEOUT=15
2017-01-13 16:07:20 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
2017-01-13 16:07:20 [scrapy.utils.log] INFO: Overridden settings: {'DNS_TIMEOUT': '15'}
(...)
CachingThreadedResolver.getHostByName(name='www.example.com', timeout=(1, 3, 11, 45))
2017-01-13 16:07:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com> (referer: None)

And Twisted 16.7 actually stops passing this default value, but a tuple (or list) of ints is expected.

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 13, 2017

This is worth a backport on 1.0, 1.1 and 1.2 branches.

@codecov-io
Copy link

@codecov-io codecov-io commented Jan 13, 2017

Codecov Report

Merging #2496 into master will increase coverage by 0.02%.

@@            Coverage Diff             @@
##           master    #2496      +/-   ##
==========================================
+ Coverage   83.46%   83.48%   +0.02%     
==========================================
  Files         161      161              
  Lines        8780     8779       -1     
  Branches     1288     1287       -1     
==========================================
+ Hits         7328     7329       +1     
+ Misses       1204     1203       -1     
+ Partials      248      247       -1
Impacted Files Coverage Δ
scrapy/resolver.py 89.47% <100%> (+9.47%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ca191e...d2e9ea0. Read the comment docs.

@redapple redapple changed the title Enforce DNS resolution timeout [MRG] Enforce DNS resolution timeout Jan 31, 2017
@@ -16,8 +16,7 @@ def __init__(self, reactor, cache_size, timeout):
def getHostByName(self, name, timeout=None):
if name in dnscache:
return defer.succeed(dnscache[name])
if not timeout:
timeout = self.timeout
timeout = (self.timeout,)

This comment has been minimized.

@dangra

dangra Jan 31, 2017
Member

it means getHostByName callers can't not override timeout which is acceptable but it is possible for self.timeout to be a tuple and convert it only if not. This way we don't lost access to existent functionality provided by t.i.b.ThreadedResolver.

This comment has been minimized.

@redapple

redapple Jan 31, 2017
Author Contributor

Sorry, I don't understand.

This comment has been minimized.

@dangra

dangra Jan 31, 2017
Member

np, it's me being obfuscated :)

the question is: Can DNS_TIMEOUT be a tuple?

This comment has been minimized.

@dangra

dangra Jan 31, 2017
Member

the question is: Can DNS_TIMEOUT be at tuple?

I guess not without some other major changes at

timeout=self.settings.getfloat('DNS_TIMEOUT')

This comment has been minimized.

@dangra

dangra Jan 31, 2017
Member

Do I understand that you would like to allow users to pass a tuple of ints/floats as DNS_TIMEOUT setting value?

Yes, that was it.

This comment has been minimized.

@kmike

kmike Jan 31, 2017
Member

@redapple could you please add a comment above this line which explains why timeout argument is ignored?

This comment has been minimized.

@redapple

redapple Jan 31, 2017
Author Contributor

@kmike , done.

@dangra
Copy link
Member

@dangra dangra commented Jan 31, 2017

LGTM.

@redapple redapple force-pushed the redapple:dns-resolver-timeout branch from 1857b4e to d2e9ea0 Jan 31, 2017
@sibiryakov
Copy link
Contributor

@sibiryakov sibiryakov commented Feb 2, 2017

I would vote for accepting tuple or int/float by Scrapy. There are still desperate people on this planet who tries to do broad crawling and they could benefit from this functionality. Cutting it off, you also cut DNS retries management. Someday DNS providers will thank you for having this option.

@redapple
Copy link
Contributor Author

@redapple redapple commented Feb 2, 2017

@sibiryakov , I do not understand why accepting tuple helps with broad crawls (honest question).
This patch is about forcing the DNS resolution timeout to the value set in scrapy settings (prior, only Twisted's default 60s was being used).
Other changes to DNS resolution will probably be based on the new pluggable resolver interface in Twisted 17.

@redapple
Copy link
Contributor Author

@redapple redapple commented Feb 2, 2017

@sibiryakov , are you ok with opening a seperate new issue to discuss DNS resolution implementation and configuration options that would play nicer for broad crawls?

@sibiryakov
Copy link
Contributor

@sibiryakov sibiryakov commented Feb 2, 2017

sure @redapple

@kmike kmike merged commit 245287f into scrapy:master Feb 2, 2017
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 83.46%)
Details
codecov/project 83.48% (+0.02%) compared to 4ca191e
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@redapple redapple deleted the redapple:dns-resolver-timeout branch Feb 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

5 participants