Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible recrawl bugs in v2.1 #59

Closed
halfer opened this issue Apr 5, 2017 · 2 comments
Closed

Possible recrawl bugs in v2.1 #59

halfer opened this issue Apr 5, 2017 · 2 comments

Comments

@halfer
Copy link

halfer commented Apr 5, 2017

I initially thought this might be a duplicate of issue #3. I have a simple site I am scraping (just for test purposes, it's my own site) and I am seeing some unusual behaviour in terms of URLs being scraped more than once. I am expecting a successful crawl to record about 10 pages, with one crawl per page.

In the above-mentioned issue, attention was drawn to the method hasAlreadyCrawled(), which does not seem to be present in the 2.x releases. I don't know if that is relevant.

  • If I use 2.1 mostly as-is, the system will call hasBeenCrawled in the CrawlObserver many times for each URL.
  • If I use 2.1 but keep a track of what has already been crawled, and reject duplicates using shouldCrawl(), then the system seems to run the CPU at 100% for many minutes, and needs cancelling via ^C.

I think 1.3 is mostly OK, with one surprising gotcha:

  • If I use 1.3 as-is, I get the results I expect - 13 URLs scraped in 34 sec.
  • If I use 1.3 but keep a track of what has already been crawled, shouldCrawl() is called for all 13, but hasBeenCrawled() is only called for the root page. I guess this is not a bug, and that I should simply not be trying to keep track of dups on my end.

Here is the script I am using with 1.3, you can see I've commented out things that are for the 2.x branch. This should run without any modifications being necessary. I've copied CrawlInternalUrls into the script, as this is not included in that release.

Here is the script I am using with 2.1, and this is what the output looks like (from hasBeenCrawled()):

Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
Crawled URL: /en/tutorial/make-your-own-blog/improving-the-installer
Crawled URL: /en/tutorial/make-your-own-blog/commenting-form
Crawled URL: /en/tutorial/make-your-own-blog/adding-a-login-system
Crawled URL: /en/tutorial/make-your-own-blog/tidy-up
Crawled URL: /en/tutorial/make-your-own-blog/new-post-creation
Crawled URL: /en/tutorial/make-your-own-blog/post-editing
Crawled URL: /en/tutorial/make-your-own-blog/all-posts-page
Crawled URL: /en/tutorial/make-your-own-blog/comment-admin
Crawled URL: /en/tutorial/make-your-own-blog/adding-some-polish
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
^C

As you can see there's a lot of dups in there. That's version 2.1.2, on Ubuntu, using PHP 7.1.x.

I will fall back to 1.3.x for the time being, but do let me know if you want me to do any testing for you - happy to help if I can.

@halfer halfer changed the title Possible recrawl bugs in 1.3 or 2.1 Possible recrawl bugs in v2.1 Apr 5, 2017
@AlexVanderbist
Copy link
Member

AlexVanderbist commented Apr 6, 2017

Hi @halfer

It looks like you forgot about the different versions of your tutorial ;)

The ->path() method on the Url object doesn't include any url queries (like ?version). Try the following:

public function hasBeenCrawled(Url $url, $response, Url $foundOnUrl = null)
{
   echo sprintf("Crawled URL: %s\n", $url);
}

You'll see that every page including every different version is being crawled as intended!

hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v1 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction
hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v2 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction
hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v3 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction
hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v4 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction
hasBeenCrawled: http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction?version=v5 - found on http://ilovephp.jondh.me.uk/en/tutorial/make-your-own-blog/introduction

Hopefully this clears some things up for you.

@halfer
Copy link
Author

halfer commented Apr 6, 2017

Ha ha, thanks @AlexVanderbist - oops! I did completely forget about that.

That presents an interesting question then - v1.3 does not behave in the same way, so does that not take query strings into account when differentiating URLs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants