Home

Revisiting web pages

I found this very interesting answer in Stack Overflow about revisiting crawlers:

http://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages

The first link I already knew about when designing this: http://crawl-frontier.readthedocs.org/en/opic/topics/scheduler-optimal.html

The main problem with that formulation is that you need to solve an optimization problem, although a convex one, and so some approximation (clustering similar pages) was made. Maybe if we code it in C it will be more efficient and could scale to billions of pages as has happened with PageRank/HITS.

The second link about page change rate estimation I didn't know about. It seems it arrives at the same estimator I got here although following a different process: http://crawl-frontier.readthedocs.org/en/opic/topics/scheduler-optimal.html#estimation-of-page-change-rate

I should read the paper nevertheless because it gives a way deeper study of the estimator.

It seems that other crawlers use more simpler approaches, where the wait time between refreshes just get adjusted if the page has changed:

Nutch: http://www.quora.com/Does-nutch-support-revisit-and-selection-policy-of-a-typical-web-crawler
Heritrix: See section 6.1.2.4. AdaptiveRevisitingFrontie at http://crawler.archive.org/articles/user_manual/config.html#frontier

URL compression

URLs take most of the space of the PageDB. Right now we are using the smaz compression library, where each URL gets compressed individually.

This could be vastly improved since URLs have lot of common prefixes. For example, picking a random section of the database on a real world crawl:

http://admob.blogspot.com.es/2015_01_01_archive.html 
http://admob.blogspot.com.es/2015_02_01_archive.html 
http://admob.blogspot.com.es/2015_03_01_archive.html 
http://admob.blogspot.com.es/2015_04_01_archive.html 
http://admob.blogspot.com.es/2015_05_01_archive.html 
http://admob.blogspot.com.es/2015_06_01_archive.html
http://admob.blogspot.com.es/search/label/8%20Things%20You%20Didn't%20Know%20About%20AdMob
http://admob.blogspot.com.es/search/label/Ad%20Network%20Optimization
http://admob.blogspot.com.es/search/label/AdMob%20Developer%20Referral%20Program
http://admob.blogspot.com.es/search/label/Announcement
http://admob.blogspot.com.es/search/label/App%20Developer%20Business%20Kit
http://admob.blogspot.com.es/search/label/App%20Spotlight
http://admob.blogspot.com.es/search/label/Audience%20Builder

Reusing URLs prefixes would of course add additional lookups, which might diminish performance, but:

Maybe the tradeoff is worth the effort
Maybe we actually don't loose speed. Compressing URLs could allow more chunks of the database fitting inside the RAM instead of disk.
Maybe we don't access URLs that often. It think URLs are only used when serving new requests, everything else: the schedule, the links, the PageRank/HITS algorithm work with either hashes or indexes.

Debugging C library from python

When debugging the C code from python it is useful to compile without optimizations by specifying:

CFLAGS=-O0 python setup.py develop

Then to debug:

gdb python
(gdb) b some_function
(gdb) r path_to_scrapy crawl example

Frequency specification

Specify frequency for each page using a table with two columns. The first is a regex for the URL and the second one is the desired recrawl interval. For example:

https://www.reddit\.com/.*      600.0
https://news\.ycombinator\.com  600.0
https?://.*news.*               3600.0
http://techcrunch.com/*         x0.8

The last line says that the page should be crawled at 0.8 times the estimated page frequency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Revisiting web pages

URL compression

Debugging C library from python

Frequency specification

Clone this wiki locally