Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make list of enabled middlewares more readable #1263

Merged
merged 1 commit into from Sep 10, 2015

Conversation

@chekunkov
Copy link
Contributor

@chekunkov chekunkov commented May 29, 2015

Inspired by today's mail from @dangra.

Before:

2015-05-29 09:45:57+0000 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware,
DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, CustomRetryMiddleware,
SlotDelayMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, RedirectMiddleware,
CrawleraMiddleware, CustomCookiesMiddleware, CookiesMiddleware, SplashMiddleware,
HttpCompressionMiddleware, CustomHttpProxyMiddleware, RotateUserAgentMiddleware,
ChunkedTransferMiddleware, ResponseStats, DownloaderStats

After:

2015-05-29 09:44:15+0000 [scrapy] INFO: Enabled downloader middlewares:
- HttpAuthMiddleware
- DownloadTimeoutMiddleware
- UserAgentMiddleware
- RetryMiddleware
- CustomRetryMiddleware
- SlotDelayMiddleware
- DefaultHeadersMiddleware
- MetaRefreshMiddleware
- MetaRefreshMiddleware
- HttpCompressionMiddleware
- RedirectMiddleware
- RedirectMiddleware
- CrawleraMiddleware
- CustomCookiesMiddleware
- CookiesMiddleware
- SplashMiddleware
- HttpCompressionMiddleware
- CustomHttpProxyMiddleware
- RotateUserAgentMiddleware
- ChunkedTransferMiddleware
- ResponseStats
- DownloaderStats

A bit easier to read and detect problems IMO

@Digenis
Copy link
Member

@Digenis Digenis commented May 29, 2015

I can't imagine how someone would use what I suggest,
but if a log message is so close to the representation of a python data structure,
I most often take a step further and make it parseable as such.
Eg. The above could be a tuple, with comas and indentation instead of dashes.

@chekunkov
Copy link
Contributor Author

@chekunkov chekunkov commented May 29, 2015

@Digenis for me main goal of this PR - make list of enabled middlewares easier to read. I'm not against logging list of middlewares as a json if this will found support from core Scrapy developers:

2015-05-29 09:44:15+0000 [scrapy] INFO: Enabled downloader middlewares:
[
    "HttpAuthMiddleware",
    "DownloadTimeoutMiddleware",
    "UserAgentMiddleware",
    ...
]

But on the other side Scrapy logging is for humans, not for robots, that's why I tried to make output more or less pretty, intuitive and easy to read + you can copy-paste such output in a markdown editor and it will be converted to unordered list, it's quite convenient.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented May 29, 2015

@chekunkov If you are going to make it so large, why don't you log the whole dict returned by settings, with full path to the classes?

Instead of using json you can use pprint.pformat.

@chekunkov
Copy link
Contributor Author

@chekunkov chekunkov commented May 29, 2015

ok, it would look like:

2015-05-29 13:45:09+0000 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'project.middlewares.retry.CustomRetryMiddleware',
 'project.middlewares.delay.SlotDelayMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'project.middlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'project.middlewares.redirect.RedirectMiddleware',
 'project.middlewares.crawlera.CrawleraMiddleware',
 'project.middlewares.cookies.CustomCookiesMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'project.middlewares.splash.SplashMiddleware',
 'project.middlewares.httpcompression.HttpCompressionMiddleware',
 'project.middlewares.httpproxy.CustomHttpProxyMiddleware',
 'project.middlewares.useragent.RotateUserAgentMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'project.middlewares.responsestats.ResponseStats',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']

looks ugly-ish but if this PR is not rejected and majority of people vote for this format - I will make it like this. Having full paths of the mw classes might be useful.

Regards dict from settings - it's not available in the scope, _get_mwlist_from_settings returns list of class paths sorted by mw priority and I didn't want to make significat changes to the code, just change output format.

@Digenis
Copy link
Member

@Digenis Digenis commented Jun 16, 2015

Aren't they sorted by priority currently?
If the order is undefined/meaningless,
then displaying them sorted by priority has no drawbacks and it's a desired change.

I agree with the full path, a local subclass of a mw could have the same name.

@chekunkov
Copy link
Contributor Author

@chekunkov chekunkov commented Jun 16, 2015

Aren't they sorted by priority currently?

see above comment

_get_mwlist_from_settings returns list of class paths sorted by mw priority

it's current behaviour, it's reasonable, this PR doesn't change it, this PR is only about output format.

@Digenis
Copy link
Member

@Digenis Digenis commented Jun 16, 2015

Never mind, should have looked at the code directly.
So in this case, I'd like to see the last example implemented and merged.

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Sep 8, 2015

+1 to make the default a list with the full path. I think this sample from the output kinda makes the case for it:

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'project.middlewares.redirect.MetaRefreshMiddleware',
@dangra
Copy link
Member

@dangra dangra commented Sep 9, 2015

Ok, there is enough concensus to add this feature, my vote is to show full class path if we go for multiline format.

@chekunkov chekunkov force-pushed the chekunkov:readable-middlewares-list-in-logs branch from aa60a0f to 563b150 Sep 10, 2015
@chekunkov
Copy link
Contributor Author

@chekunkov chekunkov commented Sep 10, 2015

Updated PR

@codecov-io
Copy link

@codecov-io codecov-io commented Sep 10, 2015

Current coverage is 82.35%

Merging #1263 into master will not affect coverage as of 5d8dc2b

Powered by Codecov. Updated on successful CI builds.

dangra added a commit that referenced this pull request Sep 10, 2015
…logs

Make list of enabled middlewares more readable
@dangra dangra merged commit a778762 into scrapy:master Sep 10, 2015
2 checks passed
2 checks passed
codecov/patch 100.00% of diff hit (target 100.00%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kmike
Copy link
Member

@kmike kmike commented Sep 11, 2015

I'm a bit worried that by default with INFO log level scrapy crawl now dumps more than a full laptop terminal screen of text before even starting to crawl.

@kmike
Copy link
Member

@kmike kmike commented Sep 11, 2015

What about making these messages DEBUG?

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Sep 11, 2015

hm, yeah, this is kinda unexpected:

$ scrapy shell
2015-09-11 17:41:38 [scrapy] INFO: Scrapy 1.1.0dev1 started (bot: scrapybot)
2015-09-11 17:41:38 [scrapy] INFO: Optional features available: ssl, http11
2015-09-11 17:41:38 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2015-09-11 17:41:38 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memdebug.MemoryDebugger',
 'scrapy.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.spiderstate.SpiderState',
 'scrapy.extensions.throttle.AutoThrottle']
2015-09-11 17:41:38 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2015-09-11 17:41:38 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2015-09-11 17:41:38 [scrapy] INFO: Enabled item pipelines:
[]
2015-09-11 17:41:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-09-11 17:41:38 [root] DEBUG: Using default logger
2015-09-11 17:41:38 [root] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fcd7c079b50>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x7fcd7c079ad0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>> 
@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Sep 11, 2015

+1 to make this DEBUG

@kmike
Copy link
Member

@kmike kmike commented Sep 11, 2015

DEBUG won't fix the verbosity though because log level is DEBUG by default - scrapy shell example will be the same.

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Sep 11, 2015

right, to have it less verbose one would have to do scrapy shell -L INFO...

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Sep 11, 2015

this verbosity is most of the time helpful; I guess we can add a setting to disable logging this useful information (:

@kmike
Copy link
Member

@kmike kmike commented Sep 13, 2015

@nramirezuy this information is useful if you're using custom middlewares and trying to debug why aren't they applied and what's applied. If you don't need this (and very often you don't) extra verbosity makes debugging harder because it makes other logs harder to find - you need to skip all this stuff when scrolling back terminal window.

In some cases it is better to always have this verbose information (e.g. in hosted Scrapy environments where it is easy to search logs), but I'd prefer for this extra information to be opt-in, to make Scrapy less scary for people who use it locally for small tasks. 🚲 🎪

@chekunkov
Copy link
Contributor Author

@chekunkov chekunkov commented Sep 14, 2015

so options we have:

  1. make if DEBUG. pros: if you don't need extra verbosity - use appropriate logging level. cons: it doesn't make shelf command less verbose, it prints a lot of info before starting doing something valuable. is that a big problem? It shows configuration only once... And I didn't know that saving number of lines written in terminal before launching interactive shell or starting crawl that usually dumps much much more than a screen of text has ever been an issue :)

  2. we can add some option (--logconfig?)to enable this whole block of log messages

2015-09-14 11:53:02 [scrapy] INFO: Scrapy 1.0.1 started (bot: obormot)
2015-09-14 11:53:02 [scrapy] INFO: Optional features available: ...
2015-09-14 11:53:02 [scrapy] INFO: Overridden settings: ...
2015-09-14 11:53:03 [scrapy] INFO: Enabled extensions: ...
2015-09-14 11:53:03 [scrapy] INFO: Enabled downloader middlewares: ...
2015-09-14 11:53:03 [scrapy] INFO: Enabled spider middlewares: ...
2015-09-14 11:53:03 [scrapy] INFO: Enabled item pipelines: ...

and keep it disabled by default. but wouldn't it harm the purpose of such logging? it's there to provide user full information on how crawler is configured

If you don't need this (and very often you don't) extra verbosity makes debugging harder because it makes other logs harder to find - you need to skip all this stuff when scrolling back terminal window.

well, sometimes user think he don't need this logging, but having this logging actually helps to find problems. mail from @dangra I referred in the PR description was about two copies of Autothrottle middleware that were running in the same crawler process and significantly slowed down crawl. problem with old logging was that it wasn't readable at all, just a mix of words that doesn't provide any hint on what middlewares are enabled unless you specifically search some middleware - I think that's why it took so long to detect that problem. I hope this PR made logging more readable and next time finding issues like that will be easier. I you don't see this logging by default and if you don't know you have problems with crawl configuration - it will take more time to detect and resolve issue.

I you still think that should be optional - maybe we can also discuss this option:

  1. remove these message from logs and add command which dumps configuration and exits - say scrapy showconfig. It would:
  • make things consistent - you don't expect this info to appear in your logs, if you need it you call this command and check you config
  • in can output nicely formatted json
  • hosted Scrapy environment can show output of this command in some tab where user can navigate to check scrapy version and see other configurations - I think it's even more convenient that searching this in logs
  • this will reduce number of logs which are fired by Scrapy without spider object in extra - and this sounds good and somehow related to this old discussion #1060 (comment) I hope once it would be possible to completely separate logs coming from different spiders in the same process...
redapple added a commit to redapple/scrapy that referenced this pull request Jan 26, 2016
Wrong middlewares list was being pretty-printed
(introduced in scrapy#1263)
@jschilling1
Copy link

@jschilling1 jschilling1 commented Feb 5, 2016

"Overridden settings" could really use the same treatment. That string is pretty long for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

8 participants