Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add note that "allowed_domains" should be a list of domains, not URLs #2250

Closed
redapple opened this issue Sep 15, 2016 · 11 comments
Closed

Add note that "allowed_domains" should be a list of domains, not URLs #2250

redapple opened this issue Sep 15, 2016 · 11 comments

Comments

@redapple
Copy link
Contributor

redapple commented Sep 15, 2016

(just logging the issue before I forget)

It may seem obvious by the name of the attribute that allowed_domains is about domain names, but it's not uncommon for scrapy users to make the mistake of doing allowed_domains = ['http://www.example.com']

I believe it is worth adding a note in http://doc.scrapy.org/en/latest/topics/spiders.html?#scrapy.spiders.Spider.allowed_domains

@eliasdorneles
Copy link
Member

eliasdorneles commented Sep 15, 2016

What if instead of simply documenting, Scrapy detect this case and issues a warning?

Even better, it could extract the domain from the URL and use that, while issuing a warning like:

logging.warn("allowed_domains accepts only domains, not URLs. Using allowed_domains = %r" % effective_allowed_domains)

@redapple
Copy link
Contributor Author

+1 to issue a warning.

I'm less sure about inferring domain, for example for http://www.example.com, should it infer example.com or www.example.com?

@eliasdorneles
Copy link
Member

I think the user expectation would be www.example.com.

I think it would be reasonable to use whatever urlparse parses as netloc:

>>> import urlparse
>>> urlparse.urlparse('http://www.example.com')
ParseResult(scheme='http', netloc='www.example.com', path='', params='', query='', fragment='')

@redapple
Copy link
Contributor Author

Are you ok with having the new behavior as a seperate issue?

@eliasdorneles
Copy link
Member

Sure, inferring a domain from an URL in allowed_domains looks like a separate feature indeed.

But I think warning users about URLs detected in allowed_domains and warning users about it belongs to the same issue as documenting (this one).
I mean, it isn't great to point users to docs when we can give them instant feedback.

Makes sense?

@redapple
Copy link
Contributor Author

It does

@redapple redapple added this to the v1.3 milestone Nov 16, 2016
@jlong49
Copy link

jlong49 commented Jan 16, 2017

May I have this issue?

jlong49 pushed a commit to jlong49/scrapy that referenced this issue Jan 18, 2017
@redapple redapple modified the milestones: v1.5, v1.4 Mar 2, 2017
@SiddharthaAnand
Copy link

Hi,
I am a beginner. Can I pick up this issue to contribute if noone else is looking into this ?

@aitoehigie
Copy link

Sent in a PR for this

@Jane222
Copy link
Contributor

Jane222 commented Nov 12, 2017

Hey, seeing as the last pull request hasn't been accepted for quite some time, mind if I submit my own? This would be my first one...

lopuhin added a commit that referenced this issue Dec 8, 2017
[MRG+1] Issues a warning when user puts a URL into allowed_domains (#2250)
@dangra
Copy link
Member

dangra commented Dec 12, 2017

Fixed by #3011

@dangra dangra closed this as completed Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants