[MRG+1] Adding more settings to project template #1073
Conversation
# Configure a delay for requests for the same website | ||
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay | ||
#DOWNLOAD_DELAY=3 | ||
# The download delay setting will honor only one of: |
nramirezuy
Mar 16, 2015
Contributor
Which one is the default used? 😄
Which one is the default used?
nramirezuy
Mar 16, 2015
Contributor
CONCURRENT_REQUESTS
is also missing.
CONCURRENT_REQUESTS
is also missing.
eliasdorneles
Mar 16, 2015
Author
Member
thanks, you're right, lemme update this adding those too.
thanks, you're right, lemme update this adding those too.
# Crawl responsibly by identifying yourself (and your website) on the user-agent | ||
#USER_AGENT = '$project_name (+http://www.yourdomain.com)' | ||
|
||
|
||
# Configure a delay for requests for the same website |
nramirezuy
Mar 16, 2015
Contributor
What about Autothrottle
?
What about Autothrottle
?
eliasdorneles
Mar 16, 2015
Author
Member
it's at the bottom.
it's at the bottom.
nramirezuy
Mar 16, 2015
Contributor
saw it, but doesn't worth to mention?
Configure a delay for requests (affected by Autothrottle)
for the same website
depends on the concurrents requests you choose, default is website.
saw it, but doesn't worth to mention?
Configure a delay for requests (affected by Autothrottle)
for the same website
depends on the concurrents requests you choose, default is website.
eliasdorneles
Mar 16, 2015
Author
Member
Well, I see it like, DOWNLOAD_DELAY affects Autothrottle extension, not the other way around. Also, Autothrottle is disabled by default.
Therefore, I think we don't need to call attention for it here.
Regarding the observation for the same website -- agreed, that's why I put the two concurrent requests per domain and per ip settings soon after.
I was thinking that'd be enough, but I'm open to suggestions. Anything in mind?
Well, I see it like, DOWNLOAD_DELAY affects Autothrottle extension, not the other way around. Also, Autothrottle is disabled by default.
Therefore, I think we don't need to call attention for it here.
Regarding the observation for the same website -- agreed, that's why I put the two concurrent requests per domain and per ip settings soon after.
I was thinking that'd be enough, but I'm open to suggestions. Anything in mind?
# DOWNLOADER_MIDDLEWARES = { | ||
# '$project_name.middlewares.MyCustomDownloaderMiddleware': 543, | ||
# } | ||
|
nramirezuy
Mar 16, 2015
Contributor
no extensions or pipelines (even tho I'm not agree with item pipelines 👅 )? :(
no extensions or pipelines (even tho I'm not agree with item pipelines
eliasdorneles
Mar 16, 2015
Author
Member
Right, I'll put those too. :)
Right, I'll put those too. :)
# Configure maximum concurrent requests performed by Scrapy (default: 16) | ||
# CONCURRENT_REQUESTS=32 | ||
|
||
# Configure a delay for requests for the same website (default: 0) |
nramirezuy
Mar 17, 2015
Contributor
Configure a delay for requests for the same slot
and they have to investigate on the doc what a slot is 👅
Configure a delay for requests for the same slot
and they have to investigate on the doc what a slot is
kmike
Mar 17, 2015
Member
I think "website" is fine - it could mean either "ip" or "domain" depending on settings. Slots are not documented, so investigation won't help :)
I think "website" is fine - it could mean either "ip" or "domain" depending on settings. Slots are not documented, so investigation won't help :)
nramirezuy
Mar 17, 2015
Contributor
Oh, I guess website will have to do. I is also that way on the docs. It's a shame that slots aren't documented 😢
Oh, I guess website will have to do. I is also that way on the docs. It's a shame that slots aren't documented
I've changed the detail settings for autothrottle and HTTP cache to the default values. It seems more sensible in these cases that have an on/off setting, because a user enabling all of them together would expect the defaults. Thanks @redapple for the heads up. ;) |
# Crawl responsibly by identifying yourself (and your website) on the user-agent | ||
#USER_AGENT = '$project_name (+http://www.yourdomain.com)' | ||
|
||
# Configure maximum concurrent requests performed by Scrapy (default: 16) | ||
# CONCURRENT_REQUESTS=32 |
kmike
Mar 19, 2015
Member
I think we should be consistent with comments vs commented out code - either write it like this everywhere:
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS=32
or like this:
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = '$project_name (+http://www.yourdomain.com)'
I think we should be consistent with comments vs commented out code - either write it like this everywhere:
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS=32
or like this:
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = '$project_name (+http://www.yourdomain.com)'
eliasdorneles
Mar 19, 2015
Author
Member
it took me a few seconds to realize you were meaning the indentation -- you're right, I'll fix it :)
it took me a few seconds to realize you were meaning the indentation -- you're right, I'll fix it :)
eliasdorneles
Mar 19, 2015
Author
Member
done
done
I don't have an opinion on what settings should we put to the default template. For the existing set of settings the PR looks good. |
…late [MRG+1] Adding more settings to project template
Hi, folks!
Here is a proposal for addressing issue #665.
I changed some wording, and added a few more settings that I felt were important or commonly missed.
So, what do you think? Does this look good?
Thank you,
Elias