Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Adding more settings to project template #1073

Merged
merged 4 commits into from Mar 27, 2015

Conversation

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Mar 14, 2015

Hi, folks!

Here is a proposal for addressing issue #665.
I changed some wording, and added a few more settings that I felt were important or commonly missed.

So, what do you think? Does this look good?

Thank you,
Elias

# Configure a delay for requests for the same website
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
#DOWNLOAD_DELAY=3
# The download delay setting will honor only one of:

This comment has been minimized.

@nramirezuy

nramirezuy Mar 16, 2015
Contributor

Which one is the default used? 😄

This comment has been minimized.

@nramirezuy

nramirezuy Mar 16, 2015
Contributor

CONCURRENT_REQUESTS is also missing.

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 16, 2015
Author Member

thanks, you're right, lemme update this adding those too.

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = '$project_name (+http://www.yourdomain.com)'


# Configure a delay for requests for the same website

This comment has been minimized.

@nramirezuy

nramirezuy Mar 16, 2015
Contributor

What about Autothrottle?

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 16, 2015
Author Member

it's at the bottom.

This comment has been minimized.

@nramirezuy

nramirezuy Mar 16, 2015
Contributor

saw it, but doesn't worth to mention?
Configure a delay for requests (affected by Autothrottle)

for the same website depends on the concurrents requests you choose, default is website.

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 16, 2015
Author Member

Well, I see it like, DOWNLOAD_DELAY affects Autothrottle extension, not the other way around. Also, Autothrottle is disabled by default.
Therefore, I think we don't need to call attention for it here.

Regarding the observation for the same website -- agreed, that's why I put the two concurrent requests per domain and per ip settings soon after.
I was thinking that'd be enough, but I'm open to suggestions. Anything in mind?

# DOWNLOADER_MIDDLEWARES = {
# '$project_name.middlewares.MyCustomDownloaderMiddleware': 543,
# }

This comment has been minimized.

@nramirezuy

nramirezuy Mar 16, 2015
Contributor

no extensions or pipelines (even tho I'm not agree with item pipelines 👅 )? :(

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 16, 2015
Author Member

Right, I'll put those too. :)

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS=32

# Configure a delay for requests for the same website (default: 0)

This comment has been minimized.

@nramirezuy

nramirezuy Mar 17, 2015
Contributor

Configure a delay for requests for the same slot and they have to investigate on the doc what a slot is 👅

This comment has been minimized.

@kmike

kmike Mar 17, 2015
Member

I think "website" is fine - it could mean either "ip" or "domain" depending on settings. Slots are not documented, so investigation won't help :)

This comment has been minimized.

@nramirezuy

nramirezuy Mar 17, 2015
Contributor

Oh, I guess website will have to do. I is also that way on the docs. It's a shame that slots aren't documented 😢

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Mar 17, 2015

I've changed the detail settings for autothrottle and HTTP cache to the default values.

It seems more sensible in these cases that have an on/off setting, because a user enabling all of them together would expect the defaults. Thanks @redapple for the heads up. ;)

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = '$project_name (+http://www.yourdomain.com)'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS=32

This comment has been minimized.

@kmike

kmike Mar 19, 2015
Member

I think we should be consistent with comments vs commented out code - either write it like this everywhere:

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS=32

or like this:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = '$project_name (+http://www.yourdomain.com)'

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 19, 2015
Author Member

it took me a few seconds to realize you were meaning the indentation -- you're right, I'll fix it :)

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 19, 2015
Author Member

done

@kmike
Copy link
Member

@kmike kmike commented Mar 23, 2015

I don't have an opinion on what settings should we put to the default template. For the existing set of settings the PR looks good.

@kmike kmike changed the title Adding more settings to project template [MRG+1] Adding more settings to project template Mar 26, 2015
nramirezuy added a commit that referenced this pull request Mar 27, 2015
…late

[MRG+1] Adding more settings to project template
@nramirezuy nramirezuy merged commit a3702e7 into scrapy:master Mar 27, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants