Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of None in allowed_domains #4410

Merged
merged 1 commit into from Mar 11, 2020

Conversation

Lukas0907
Copy link
Contributor

@Lukas0907 Lukas0907 commented Mar 7, 2020

allowed_domains = [None] leads to the following exception:

2020-03-07 20:40:02 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method OffsiteMiddleware.spider_opened of <scrapy.spidermiddlewares.offsite.OffsiteMiddleware object at 0x7f57c2a22d60>>
Traceback (most recent call last):
  File "/home/lukas/Projects/3rd/scrapy/scrapy/utils/defer.py", line 161, in maybeDeferred_coro
    result = f(*args, **kw)
  File "/home/lukas/.pyenv/versions/3.8.2/envs/scrapy/lib/python3.8/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/lukas/Projects/3rd/scrapy/scrapy/spidermiddlewares/offsite.py", line 67, in spider_opened
    self.host_regex = self.get_host_regex(spider)
  File "/home/lukas/Projects/3rd/scrapy/scrapy/spidermiddlewares/offsite.py", line 58, in get_host_regex
    if url_pattern.match(domain):
TypeError: expected string or bytes-like object

However, Nones in allowed_domains ought to be ignored and there are also tests for that scenario. This commit fixes the handling of None and also the accompanying tests which are now executed again.

Nones in allowed_domains ought to be ignored and there are also tests
for that scenario. This commit fixes the handling of None and also the
accompanying tests which are now executed again.
@Lukas0907 Lukas0907 changed the title Fix handling of None in allowed_domains. Fix handling of None in allowed_domains Mar 7, 2020
@codecov
Copy link

@codecov codecov bot commented Mar 7, 2020

Codecov Report

Merging #4410 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4410      +/-   ##
==========================================
- Coverage   84.56%   84.56%   -0.01%     
==========================================
  Files         166      166              
  Lines        9869     9872       +3     
  Branches     1467     1467              
==========================================
+ Hits         8346     8348       +2     
  Misses       1268     1268              
- Partials      255      256       +1
Impacted Files Coverage Δ
scrapy/spidermiddlewares/offsite.py 100% <100%> (ø) ⬆️
scrapy/utils/trackref.py 82.85% <0%> (-2.86%) ⬇️

class TestOffsiteMiddleware3(TestOffsiteMiddleware2):

def _get_spider(self):
return Spider('foo')
def _get_spiderargs(self):
return dict(name='foo')

Copy link
Contributor

@nyov nyov Mar 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an aside. Does this class even do anything? Or could we clean this up?

Copy link
Contributor Author

@Lukas0907 Lukas0907 Mar 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the clean up! :)

The purpose of the class TestOffsiteMiddleware3 was to test that not setting the allowed_domains attribute at all in a spider works. However, due to a refactoring, the test was effectively a no-op because _get_spider() was never executed. My change should fix the test so it actually tests that use case again.

Copy link
Contributor

@nyov nyov Mar 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, of course. I blanked out the subclassing of TestOffsiteMiddleware2 somehow. Looks good.

@nyov
Copy link
Contributor

@nyov nyov commented Mar 8, 2020

Could you add a testcase for the allowed_domains=[None] case? Otherwise I don't see what you mean by fixing the tests. :P

nyov
nyov approved these changes Mar 8, 2020
Copy link
Contributor

@nyov nyov left a comment

Looks like a clean bugfix. 👍

@Lukas0907
Copy link
Contributor Author

@Lukas0907 Lukas0907 commented Mar 8, 2020

Could you add a testcase for the allowed_domains=[None] case? Otherwise I don't see what you mean by fixing the tests. :P

The test case for allowed_domains=[None] was basically already there, it was just not executed (that's why it did not get noticed that it was broken). I changed it slightly and now it should work again: https://github.com/scrapy/scrapy/pull/4410/files#diff-b7fe032065662610160b880a6d821becR62-R72
(Note the naming _get_spider() vs _get_spiderargs().)

wRAR
wRAR approved these changes Mar 11, 2020
@wRAR wRAR merged commit 2eb990a into scrapy:master Mar 11, 2020
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants