Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster default crawl queue implementation #250

Merged
merged 2 commits into from Aug 7, 2019

Conversation

@BenMorel
Copy link
Contributor

commented Aug 6, 2019

馃悓 The issue with the collection crawl queue

The default crawl queue is currently based on Illuminate Collections, and suffers from many flaws that make it slow: it loops through entire collections to find objects, and makes ultra-heavy use of Uri-to-string conversions:

foreach ($collection as $crawlUrl) {
    if ((string) $crawlUrl->url === (string) $searchCrawlUrl->url) {
        return true;
    }
}

This implementation works fine for crawling small websites (< 1000 pages); after that, the crawler starts to be CPU-bound (~100% CPU usage) and starts slowing down.

After a few thousand pages, it almost comes to a halt, and takes something like 15 seconds to move from one page to another. This delay raises exponentially with the number of pages in the queue.

馃殌 NEW: the array crawl queue

This PR brings a 100% compatible implementation:

  • based on native arrays; no more Collection overhead (wrapping, copying, filtering);
  • indexed by URL: we can instantly find a given URL with a lookup, without having to loop through the collection;
  • CrawlUrl ids are strings, using the full URL and no more artificial integer keys to lookup;
  • Shortcuts: no more contains(), we can now use isset() directly.

馃挩 Is is faster?

It is faster. Much, MUCH faster. When the original implementation almost came to a halt after a few thousand pages, the new implementation is still performing 10 requests per second after 40,000 pages, and is not even CPU bound.

That's what I suggest we make this new crawl queue the default right away, and deprecate the other one.


This PR is 100% backwards compatible, and as such does not need to wait for v5.

In v5, we might want to remove the dependency on Illuminate Collections entirely. This was a noble idea, but Collection is overkill for this use case.

BenMorel added some commits Aug 6, 2019

@freekmurze

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

After only having read the description of this PR, it already sounds very promising. 馃憤 I'll go through the code now.

@freekmurze

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

Pretty cool, thanks for your work on this. 馃檶

I'll pull it in, polish it a bit and tag a new release.

@freekmurze freekmurze merged commit 6e1b6fa into spatie:master Aug 7, 2019

3 checks passed

Scrutinizer 1 new issues, 8 updated code elements
Details
continuous-integration/styleci/pr The analysis has passed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
{
$url = (string) $url->url;
return ! isset($this->pendingUrls[$url]) && isset($this->pendingUrls[$url]);

This comment has been minimized.

Copy link
@freekmurze

freekmurze Aug 7, 2019

Member

Maybe it's because I didn't have coffee yet, but I don't understand this line. Could you explain what you're checking here?

This comment has been minimized.

Copy link
@freekmurze

freekmurze Aug 7, 2019

Member

Added some tests around this and fixed this on master

This comment has been minimized.

Copy link
@AbdelkaderBah

AbdelkaderBah Aug 7, 2019

  • fix line 61:
    return ! isset($this->pendingUrls[$url]) && isset($this->processedUrls[$url]);

This comment has been minimized.

Copy link
@BenMorel

BenMorel Aug 7, 2019

Author Contributor

Looks like you figured it out in 9078d28 ! This line is the strict translation of:

return ! $this->contains($this->pendingUrls, $url) && $this->contains($this->urls, $url);

But less formatted for readability.

This comment has been minimized.

Copy link
@juancho48

juancho48 Aug 7, 2019

isset($this->pendingUrls[$url]) is repeated twice? doesn't this return false always?

This comment has been minimized.

Copy link
@BenMorel

BenMorel Aug 7, 2019

Author Contributor

OMG I'm just seing it now, someone had to put it twice right under my eyes before I could see it. Thanks for pointing this out, it should have been && isset($this->urls[$url]). This has been fixed in 9078d28 anyway!

This comment has been minimized.

Copy link
@juancho48

juancho48 Aug 7, 2019

I just wanna help out... cool thing is fixed

* @var CrawlUrl[]
*/
protected $pendingUrls = [];

This comment has been minimized.

Copy link
@AbdelkaderBah

AbdelkaderBah Aug 7, 2019

Add
protected $processedUrls = []

This comment has been minimized.

Copy link
@BenMorel

BenMorel Aug 7, 2019

Author Contributor

There is no real justification for an additional $processedUrls here: processed URLs are those that are in $urls and not in $pendingUrls, the check is easy and with a very low performance cost.

OTOH, adding an extra property to store these would double the memory cost of the crawl queue.

What could be done is replacing $urls with $processedUrls; so that pending URLs are only stored in $pendingUrls, and moved to $processedUrls once done. I'm not sure if this is a big improvement, but it can be done.

This comment has been minimized.

Copy link
@freekmurze

freekmurze Aug 7, 2019

Member

Ok, let's leave it like it is for now. Thanks for your feedback.

public function markAsProcessed(CrawlUrl $crawlUrl)
{
$url = (string) $crawlUrl->url;

This comment has been minimized.

Copy link
@AbdelkaderBah

AbdelkaderBah Aug 7, 2019

  • add $url to $this->processedUrls

This comment has been minimized.

Copy link
@freekmurze

freekmurze Aug 7, 2019

Member

@AbdelkaderBah could you PR your proposals (with tests) to master?

@BenMorel

This comment has been minimized.

Copy link
Contributor Author

commented Aug 7, 2019

Thanks for adding some tests and for releasing, @freekmurze! 馃憤

Note that I'm not fond of adding an InvalidUrl exception: this is not an exception that's meant to be caught in userland, it can only happen if there is a bug in the code. For this reason, I think using built-in exceptions (InvalidArgumentException, or better yet in PHP 7, TypeError for consistency with PHP's built-in type checker) is the way to go.

If the justification is for moving the exception message logic out of the main code flow, I think a separate factory method for the exception would be appropriate.

Anyway it's released now, let's discuss this later in v5!

@freekmurze

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

@BenMorel just wanted to let you know that already use the changes by this PR over at Oh Dear!. We see that our app now uses dramatically less CPU. Thanks to this change we can run more checks concurrently. 馃憤

@BenMorel

This comment has been minimized.

Copy link
Contributor Author

commented Aug 7, 2019

@freekmurze Wow, happy to hear that! 馃帀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can鈥檛 perform that action at this time.