Possible recrawl bugs in v2.1 #59

halfer · 2017-04-05T19:05:40Z

I initially thought this might be a duplicate of issue #3. I have a simple site I am scraping (just for test purposes, it's my own site) and I am seeing some unusual behaviour in terms of URLs being scraped more than once. I am expecting a successful crawl to record about 10 pages, with one crawl per page.

In the above-mentioned issue, attention was drawn to the method hasAlreadyCrawled(), which does not seem to be present in the 2.x releases. I don't know if that is relevant.

If I use 2.1 mostly as-is, the system will call hasBeenCrawled in the CrawlObserver many times for each URL.
If I use 2.1 but keep a track of what has already been crawled, and reject duplicates using shouldCrawl(), then the system seems to run the CPU at 100% for many minutes, and needs cancelling via ^C.

I think 1.3 is mostly OK, with one surprising gotcha:

If I use 1.3 as-is, I get the results I expect - 13 URLs scraped in 34 sec.
If I use 1.3 but keep a track of what has already been crawled, shouldCrawl() is called for all 13, but hasBeenCrawled() is only called for the root page. I guess this is not a bug, and that I should simply not be trying to keep track of dups on my end.

Here is the script I am using with 1.3, you can see I've commented out things that are for the 2.x branch. This should run without any modifications being necessary. I've copied CrawlInternalUrls into the script, as this is not included in that release.

Here is the script I am using with 2.1, and this is what the output looks like (from hasBeenCrawled()):

Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
Crawled URL: /en/tutorial/make-your-own-blog/improving-the-installer
Crawled URL: /en/tutorial/make-your-own-blog/commenting-form
Crawled URL: /en/tutorial/make-your-own-blog/adding-a-login-system
Crawled URL: /en/tutorial/make-your-own-blog/tidy-up
Crawled URL: /en/tutorial/make-your-own-blog/new-post-creation
Crawled URL: /en/tutorial/make-your-own-blog/post-editing
Crawled URL: /en/tutorial/make-your-own-blog/all-posts-page
Crawled URL: /en/tutorial/make-your-own-blog/comment-admin
Crawled URL: /en/tutorial/make-your-own-blog/adding-some-polish
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
^C

As you can see there's a lot of dups in there. That's version 2.1.2, on Ubuntu, using PHP 7.1.x.

I will fall back to 1.3.x for the time being, but do let me know if you want me to do any testing for you - happy to help if I can.

The text was updated successfully, but these errors were encountered:

AlexVanderbist · 2017-04-06T10:13:15Z

Hi @halfer

It looks like you forgot about the different versions of your tutorial ;)

The ->path() method on the Url object doesn't include any url queries (like ?version). Try the following:

public function hasBeenCrawled(Url $url, $response, Url $foundOnUrl = null)
{
   echo sprintf("Crawled URL: %s\n", $url);
}

You'll see that every page including every different version is being crawled as intended!

hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v1 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction
hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v2 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction
hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v3 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction
hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v4 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction
hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v5 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction

Hopefully this clears some things up for you.

halfer · 2017-04-06T11:10:41Z

Ha ha, thanks @AlexVanderbist - oops! I did completely forget about that.

That presents an interesting question then - v1.3 does not behave in the same way, so does that not take query strings into account when differentiating URLs?

halfer changed the title ~~Possible recrawl bugs in 1.3 or 2.1~~ Possible recrawl bugs in v2.1 Apr 5, 2017

freekmurze closed this as completed Apr 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible recrawl bugs in v2.1 #59

Possible recrawl bugs in v2.1 #59

halfer commented Apr 5, 2017 •

edited

Loading

AlexVanderbist commented Apr 6, 2017 •

edited

Loading

halfer commented Apr 6, 2017

Possible recrawl bugs in v2.1 #59

Possible recrawl bugs in v2.1 #59

Comments

halfer commented Apr 5, 2017 • edited Loading

AlexVanderbist commented Apr 6, 2017 • edited Loading

halfer commented Apr 6, 2017

halfer commented Apr 5, 2017 •

edited

Loading

AlexVanderbist commented Apr 6, 2017 •

edited

Loading