Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise error when start_url found instead of start_urls. #4170

Merged
merged 2 commits into from Dec 5, 2019

Conversation

mabelvj
Copy link
Contributor

@mabelvj mabelvj commented Nov 18, 2019

Fixes #4133

Raise AttributeError error when empty 'start_urls' and 'start_url' found. Added test.

Copy link
Member

@Gallaecio Gallaecio left a comment

I would avoid including the start_url value in the response, as it may initially mislead the user to think the issue is in the value itself.

Also, the message says 'start_urls' list not found, which I’m not sure is accurate in all cases. Maybe it should say not found or empty? And maybe we should check if it’s empty instead of if it has non-empty values (any), because make_requests_from_url should already fail if there is a start_urls list with a value that evaluates to False.

scrapy/spiders/__init__.py Outdated Show resolved Hide resolved
scrapy/spiders/__init__.py Outdated Show resolved Hide resolved
scrapy/spiders/__init__.py Outdated Show resolved Hide resolved
@mabelvj
Copy link
Contributor Author

@mabelvj mabelvj commented Nov 20, 2019

I am having trouble with the tests since it seems we should start a crawling for getting the result from start_requests -> 96f7e97.

Maybe just checking when trying to set the variable could be a good solution. I added a modified version of __set_attr__ where we can for this condition and warn the user. This way also, if start_requests is redefined, this warning will keep. Similar warnings are used in other parts of the code

scrapy/scrapy/item.py

Lines 102 to 107 in 494f38a

def __setattr__(self, name, value):
if not name.startswith('_'):
raise AttributeError("Use item[%r] = %r to set field value" %
(name, value))
super(DictItem, self).__setattr__(name, value)

@mabelvj mabelvj force-pushed the 4133-handle-start_url branch from 6823ae7 to bd64173 Compare Nov 20, 2019
@codecov
Copy link

@codecov codecov bot commented Nov 20, 2019

Codecov Report

Merging #4170 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4170      +/-   ##
==========================================
- Coverage   83.75%   83.74%   -0.01%     
==========================================
  Files         165      165              
  Lines        9719     9721       +2     
  Branches     1445     1446       +1     
==========================================
+ Hits         8140     8141       +1     
  Misses       1332     1332              
- Partials      247      248       +1
Impacted Files Coverage Δ
scrapy/spiders/__init__.py 98.46% <100%> (+0.04%) ⬆️
scrapy/utils/trackref.py 83.78% <0%> (-2.71%) ⬇️

@mabelvj mabelvj force-pushed the 4133-handle-start_url branch 2 times, most recently from 2c6f076 to a86e69d Compare Nov 20, 2019
@mabelvj mabelvj requested review from Gallaecio and elacuesta Nov 20, 2019
scrapy/spiders/__init__.py Outdated Show resolved Hide resolved
@kmike
Copy link
Member

@kmike kmike commented Nov 20, 2019

Hey! There are real-world spiders which use start_url option, as a way to customize things (found them using our internal code search). Is there a way to distinguish between intended and unintended usage? It is not good to show a warning for a perfectly valid use case.

I wonder if a better fix would be to have a warning in the default start_requests implementation: if start_urls is None, but start_url is present, then show a warning. There won't be a warning if start_requests function is using start_url attribute in this case, as it'd be overridden.

It could make sense to check real spiders which use start_request attribute, to see how any change we make affect them.

@mabelvj
Copy link
Contributor Author

@mabelvj mabelvj commented Nov 21, 2019

You are right @kmike, that was one of my other options. Setting this warning in start_requests will allow overriding it when using start_url.

@mabelvj mabelvj force-pushed the 4133-handle-start_url branch from a86e69d to 96f7e97 Compare Nov 21, 2019
…'start_urls' and 'start_url' found. Added test.
@mabelvj mabelvj force-pushed the 4133-handle-start_url branch from 96f7e97 to 1718e45 Compare Nov 21, 2019
@kmike
Copy link
Member

@kmike kmike commented Dec 5, 2019

Thanks @mabelvj!

@kmike kmike merged commit 250da28 into scrapy:master Dec 5, 2019
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants