Crawl all unique internal links found on a given website
Clone or download

Arachnid Web Crawler

This library will crawl all unique internal links found on a given website up to a specified maximum page depth.

This library is based on the original blog post by Zeid Rashwani here:

Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.

SymfonyInsight Build Status codecov

How to Install

You can install this library with Composer. Drop this into your composer.json manifest file:

    "require": {
        "zrashwani/arachnid": "dev-master"

Then run composer install.

Getting Started

Here's a quick demo to crawl a website:

require 'vendor/autoload.php';

$url = '';
$linkDepth = 3;
// Initiate crawl    
$crawler = new \Arachnid\Crawler($url, $linkDepth);

// Get link data
$links = $crawler->getLinksArray(); //to get links as objects use getLinks() method

Advanced Usage:

There are other options you can set to the crawler:

Set additional options to underlying guzzle client, by specifying array of options in constructor or passing it to setCrawlerOptions:

    //third parameter is the options used to configure guzzle client
    $crawler = new \Arachnid\Crawler('',2, 
                             ['auth'=>array('username', 'password')]);
    //or using separate method `setCrawlerOptions`
    $options = array(
        'curl' => array(
            CURLOPT_SSL_VERIFYHOST => false,
            CURLOPT_SSL_VERIFYPEER => false,
        'timeout' => 30,
        'connect_timeout' => 30,

You can inject a PSR-3 compliant logger object to monitor crawler activity (like Monolog):

$crawler = new \Arachnid\Crawler($url, $linkDepth); // ... initialize crawler   

//set logger for crawler activity (compatible with PSR-3)
$logger = new \Monolog\Logger('crawler logger');
$logger->pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));

You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks method:

//filter links according to specific callback as closure
$links = $crawler->filterLinks(function($link){
                    //crawling only blog links
                    return (bool)preg_match('/.*\/blog.*$/u',$link); 

You can use LinksCollection class to get simple statistics about the links, as following:

$links = $crawler->traverse()
$collection = new LinksCollection($links);

//getting broken links
$brokenLinks = $collection->getBrokenLinks();

//getting links for specific depth
$depth2Links = $collection->getByDepth(2);

How to Contribute

  1. Fork this repository
  2. Create a new branch for each feature or improvement
  3. Apply your code changes along with corresponding unit test
  4. Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.

All pull requests must adhere to the PSR-2 standard.

System Requirements

  • PHP 7.1.0+



MIT Public License