Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] AjaxCrawlableMiddleware #343

Merged
merged 7 commits into from Jan 16, 2014
Merged

[MRG] AjaxCrawlableMiddleware #343

merged 7 commits into from Jan 16, 2014

Conversation

@kmike
Copy link
Member

@kmike kmike commented Jul 9, 2013

What do you think about adding support for https://developers.google.com/webmasters/ajax-crawling/docs/getting-started, part 3 ("Handle pages without hash fragments")? In this pull request it is implemented using downloader middleware.

I don't know if such middleware should be enabled by default. On one hand, it is almost unuseful for 'focused' crawls (because usually only main/index pages are marked as AJAX crawlable via meta tag), and it has a performance impact (about 1/2000s in average according to my unscientific tests). On the other hand, MetaRefreshMiddleware (that also parses HTML) is enabled by default, and Scrapy handles 'ajax crawlable' pages with '#!' in URLs by default.

Another totally unscientific test shows that index pages of about 0.85% US small business websites indicate themselves as 'AJAX crawlable' using meta tag.

TODO:

  • mention it in 'broad crawling' docs;
  • add an undocumented option for changing the size of parsed contents;
  • investigate if fail-fast methods help performance.
@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Oct 14, 2013

@kmike where is this Scrapy handles 'ajax crawlable' pages with '#!' in URLs by default.?

@kmike
Copy link
Member Author

@kmike kmike commented Oct 14, 2013

@nramirezuy this is the default behavior of Request class. Here #! urls are expanded to use _escaped_fragment_:

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Oct 14, 2013

@kmike If 4k is insufficient, why are we still using it as a default?
Don't you think an undocumented setting should fit? It is less overhead for someone that have to workaround it and may he have time to report it. Plus if he have the skills to discover the setting probably have the same skills to override the attribute value.

@kmike
Copy link
Member Author

@kmike kmike commented Oct 14, 2013

4K is sufficient for MetaRefreshMiddleware, but it is insufficient for AjaxCrawlableMiddleware, that's why AjaxCrawlableMiddleware uses another arbitrary number (32K). I don't think anybody would be decreasing this number because the overhead is not that big. It may be the case somebody will want to increase this value when doing broad crawls if some of the websites don't have meta tag in first 32K. I haven't met such websites when working with AjaxCrawlableMiddleware (largest offset was about 15K if I recall properly); if there are such websites it may be better to increase this value in Scrapy itself because this condition is very hard to catch and it will be most likely unnoticed by users.

Do you prefer a setting or an attribute?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Oct 14, 2013

@kmike I prefer the setting.
I see this middleware useful for broad crawls only or very specific sites, but few where probably is wrong used the tag.
Because if you have to develop a spider for a specific site you will review the site and manually add the fragment.

In other hand it is possible some kind of cache? Something like if domains doesn't have ajax avoid doing the lookup.

@kmike
Copy link
Member Author

@kmike kmike commented Oct 14, 2013

Yes, you're right this middleware is useful mostly for broad crawls. I think we should mention it in 'broad crawls' docs section if it is disabled by default.

Per-domain cache looks like a wrong approach because presence or absence of this meta tag is a feature of a specific web page. It is usually used only for index page, but nothing prevents using this tag for other pages. There is no way to check if domain uses ajax or not.

The main downside of having this middleware enabled by default is the performance overhead. It is not that large (parsing response to DOM takes much more time). We could try to reduce the overhead using some fail-fast check (e.g. 'fragment' in body).

I prefer to have this middleware enabled by default preciously because it is useful only in some uncommon cases that are not easy to notice. If we leave it disabled by default then it is likely that an edge case will go unnoticed, middleware will remain disabled and some responses won't be handled properly. Because it is not common to have this meta tag, developers are usually unaware of it - I didn't know about this tag before starting to investigate why aren't some pages parsed properly, and these pages were noticed only by chance.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Oct 14, 2013

I didn't know about the tag either. Maybe should be nice to add it to the Scrapy F.A.Q., Broad Crawl doc and in the middleware documentation. Saying that his middleware can auto handle this case, that can be avoidable doing X and have N drawbacks, may adding a note.

kmike added 5 commits Jul 9, 2013
For me middleware now can process about 2-3k ajax crawlable pages/sec and 50k+ regular pages/sec (if they don’t contain «fragment» or «content» words).
@kmike
Copy link
Member Author

@kmike kmike commented Dec 18, 2013

I updated this PR:

  • middleware is still OFF by default;
  • there is a new section in broad crawling docs;
  • fail-fast path makes middleware fast for most regular pages (50k+ pages/sec);
  • lookup body size can be set in settings, but this setting is not documented because users shouldn't use it.
For more info see https://developers.google.com/webmasters/ajax-crawling/docs/getting-started.
"""

enabled_setting = 'AJAXCRAWLABLE_ENABLED'

This comment has been minimized.

@dangra

dangra Dec 20, 2013
Member

what is the advantage of using a class attribute for this settings instead of directly referencing it in the constructor like AJAXCRAWLABLE_MAXSIZE ?

This comment has been minimized.

@kmike

kmike Dec 20, 2013
Author Member

No advantage. I blindly copied this code from MetaRefreshMiddleware (to make it consistent), but in MetaRefreshMiddleware it serves a purpose (MetaRefreshMiddleware is a subclass of BaseRedirectMiddleware and overrides this attribute), and here it is pointless. I'll fix it.

@dangra
Copy link
Member

@dangra dangra commented Dec 20, 2013

LGTM

/cc @pablohoffman feel free to merge if you are ok.

@kmike
Copy link
Member Author

@kmike kmike commented Dec 20, 2013

Btw, are you ok with "up to 1%" in docs? I've seen 0.85% once, but for other scrape number seems to be lower. It seems to depend heavily on industry. I can remove this number or add a "based on industry" remark.

@dangra
Copy link
Member

@dangra dangra commented Dec 20, 2013

the %1 can become outdated without notice, I think it's OK to keep it with a note saying it is based on empirical data from year 2013 or similar.

dangra added a commit that referenced this pull request Jan 16, 2014
[MRG] AjaxCrawlableMiddleware
@dangra dangra merged commit b9bb9be into scrapy:master Jan 16, 2014
1 check passed
1 check passed
default The Travis CI build passed
Details
@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Jan 16, 2014

I'm glad to see this merged, but what do you think about shortening the middleware name (and its corresponding setting) to AjaxCrawl, isntead of AjaxCrawlable?

@kmike
Copy link
Member Author

@kmike kmike commented Jan 16, 2014

AjaxCrawlable is a bad name indeed, +1 for renaming it. Google calls it "ajax crawling" - https://developers.google.com/webmasters/ajax-crawling/docs/getting-started - what about AjaxCrawlingMiddleware? It can be a bit confusing because people could think this middleware executes javascript, but I think a reference to Google's description is enough.

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Jan 16, 2014

AjaxCrawling sounds better than AjaxCrawlable, but I still prefer AjaxCrawl for the its brevity.

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Jan 16, 2014

Sorry for not bringing this up before merging, could you send a new PR for the rename? (or just push directly) since we're about to release 0.22

@kmike
Copy link
Member Author

@kmike kmike commented Jan 16, 2014

Yep, I was just about to do that.

@kmike kmike deleted the kmike:ajax-crawlable branch Jun 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants