[MRG] AjaxCrawlableMiddleware #343
Conversation
@kmike where is this |
@nramirezuy this is the default behavior of Request class. Here
|
@kmike If 4k is insufficient, why are we still using it as a default? |
4K is sufficient for MetaRefreshMiddleware, but it is insufficient for AjaxCrawlableMiddleware, that's why AjaxCrawlableMiddleware uses another arbitrary number (32K). I don't think anybody would be decreasing this number because the overhead is not that big. It may be the case somebody will want to increase this value when doing broad crawls if some of the websites don't have meta tag in first 32K. I haven't met such websites when working with AjaxCrawlableMiddleware (largest offset was about 15K if I recall properly); if there are such websites it may be better to increase this value in Scrapy itself because this condition is very hard to catch and it will be most likely unnoticed by users. Do you prefer a setting or an attribute? |
@kmike I prefer the setting. In other hand it is possible some kind of cache? Something like if domains doesn't have ajax avoid doing the lookup. |
Yes, you're right this middleware is useful mostly for broad crawls. I think we should mention it in 'broad crawls' docs section if it is disabled by default. Per-domain cache looks like a wrong approach because presence or absence of this meta tag is a feature of a specific web page. It is usually used only for index page, but nothing prevents using this tag for other pages. There is no way to check if domain uses ajax or not. The main downside of having this middleware enabled by default is the performance overhead. It is not that large (parsing response to DOM takes much more time). We could try to reduce the overhead using some fail-fast check (e.g. 'fragment' in body). I prefer to have this middleware enabled by default preciously because it is useful only in some uncommon cases that are not easy to notice. If we leave it disabled by default then it is likely that an edge case will go unnoticed, middleware will remain disabled and some responses won't be handled properly. Because it is not common to have this meta tag, developers are usually unaware of it - I didn't know about this tag before starting to investigate why aren't some pages parsed properly, and these pages were noticed only by chance. |
I didn't know about the tag either. Maybe should be nice to add it to the Scrapy F.A.Q., Broad Crawl doc and in the middleware documentation. Saying that his middleware can auto handle this case, that can be avoidable doing X and have N drawbacks, may adding a note. |
For me middleware now can process about 2-3k ajax crawlable pages/sec and 50k+ regular pages/sec (if they don’t contain «fragment» or «content» words).
I updated this PR:
|
For more info see https://developers.google.com/webmasters/ajax-crawling/docs/getting-started. | ||
""" | ||
|
||
enabled_setting = 'AJAXCRAWLABLE_ENABLED' |
dangra
Dec 20, 2013
Member
what is the advantage of using a class attribute for this settings instead of directly referencing it in the constructor like AJAXCRAWLABLE_MAXSIZE
?
what is the advantage of using a class attribute for this settings instead of directly referencing it in the constructor like AJAXCRAWLABLE_MAXSIZE
?
kmike
Dec 20, 2013
Author
Member
No advantage. I blindly copied this code from MetaRefreshMiddleware (to make it consistent), but in MetaRefreshMiddleware it serves a purpose (MetaRefreshMiddleware is a subclass of BaseRedirectMiddleware and overrides this attribute), and here it is pointless. I'll fix it.
No advantage. I blindly copied this code from MetaRefreshMiddleware (to make it consistent), but in MetaRefreshMiddleware it serves a purpose (MetaRefreshMiddleware is a subclass of BaseRedirectMiddleware and overrides this attribute), and here it is pointless. I'll fix it.
LGTM /cc @pablohoffman feel free to merge if you are ok. |
Btw, are you ok with "up to 1%" in docs? I've seen 0.85% once, but for other scrape number seems to be lower. It seems to depend heavily on industry. I can remove this number or add a "based on industry" remark. |
the %1 can become outdated without notice, I think it's OK to keep it with a note saying it is based on empirical data from year 2013 or similar. |
I'm glad to see this merged, but what do you think about shortening the middleware name (and its corresponding setting) to AjaxCrawl, isntead of AjaxCrawlable? |
AjaxCrawlable is a bad name indeed, +1 for renaming it. Google calls it "ajax crawling" - https://developers.google.com/webmasters/ajax-crawling/docs/getting-started - what about AjaxCrawlingMiddleware? It can be a bit confusing because people could think this middleware executes javascript, but I think a reference to Google's description is enough. |
AjaxCrawling sounds better than AjaxCrawlable, but I still prefer AjaxCrawl for the its brevity. |
Sorry for not bringing this up before merging, could you send a new PR for the rename? (or just push directly) since we're about to release 0.22 |
Yep, I was just about to do that. |
What do you think about adding support for https://developers.google.com/webmasters/ajax-crawling/docs/getting-started, part 3 ("Handle pages without hash fragments")? In this pull request it is implemented using downloader middleware.
I don't know if such middleware should be enabled by default. On one hand, it is almost unuseful for 'focused' crawls (because usually only main/index pages are marked as AJAX crawlable via meta tag), and it has a performance impact (about 1/2000s in average according to my unscientific tests). On the other hand, MetaRefreshMiddleware (that also parses HTML) is enabled by default, and Scrapy handles 'ajax crawlable' pages with
'#!'
in URLs by default.Another totally unscientific test shows that index pages of about 0.85% US small business websites indicate themselves as 'AJAX crawlable' using meta tag.
TODO: