Advanced intelligent web scraper for Laravel with caching, rate limiting, middleware, monitoring, and much more. Built on Puppeteer with smart features that make web scraping professional, efficient, and reliable.
- β Intelligent Caching - Automatic caching to avoid redundant requests
- β Rate Limiting - Prevent overwhelming target websites
- β User-Agent Rotation - Rotate user agents automatically to avoid detection
- β Middleware System - Extensible middleware for request manipulation
- β Automatic Retry - Exponential backoff retry logic for failed requests
- β Screenshot & PDF - Capture screenshots and generate PDFs
- β Proxy Support - Full proxy support with authentication
- β Monitoring & Logging - Comprehensive monitoring and logging
- β Schema Validation - Validate extracted data against schemas
- β Concurrent Scraping - Scrape multiple URLs concurrently
- β Queue Support - Process scraping jobs in background
- β Error Handling - Robust error handling and recovery
- β Smart Site Detection - Automatically detect site type and use appropriate selectors
- β Multi-Site Support - Handle multiple websites with different HTML structures intelligently
Install the package via Composer:
composer require shammaa/laravel-smart-scraperPublish the configuration file:
php artisan vendor:publish --tag=smart-scraper-configInstall Node.js dependencies (required for Puppeteer):
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealthNote: Make sure Node.js is installed and available in your PATH. If using NVM, see the NVM Configuration section below.
Generate a scraper class using the Artisan command:
php artisan make:scraper ProductScraperThis creates a file at app/Scrapers/ProductScraper.php:
<?php
namespace App\Scrapers;
use Shammaa\LaravelSmartScraper\Scraper;
class ProductScraper extends Scraper
{
protected function handle(): array
{
$crawler = $this->getCrawler();
return [
'title' => $crawler->filter('h1')->text(''),
'price' => $crawler->filter('.price')->text(''),
'description' => $crawler->filter('.description')->text(''),
];
}
}use App\Scrapers\ProductScraper;
$data = ProductScraper::scrape('https://example.com/product/123')
->timeout(10000)
->run();
dd($data);use App\Scrapers\ProductScraper;
$data = ProductScraper::scrape('https://example.com/product/123')->run();$data = ProductScraper::scrape('https://example.com/product/123')
->timeout(20000) // 20 seconds timeout
->proxy('ip:port', 'user', 'pass') // Use proxy
->headers(['Accept-Language' => 'en']) // Custom headers
->retry(3, 5) // Retry 3 times, wait 5 seconds
->cache(false) // Disable caching
->run();You can pass parameters to the handle() method:
<?php
namespace App\Scrapers;
use Shammaa\LaravelSmartScraper\Scraper;
class ProductScraper extends Scraper
{
protected function handle(string $selector = 'h1'): array
{
$crawler = $this->getCrawler();
return [
'title' => $crawler->filter($selector)->text(''),
];
}
}Then use it:
$data = ProductScraper::scrape('https://example.com/product/123')
->run(selector: '.product-title');The scraper automatically caches results to avoid redundant requests:
// Enable caching (default)
$data = ProductScraper::scrape('https://example.com/product/123')
->cache(true)
->run();
// Disable caching
$data = ProductScraper::scrape('https://example.com/product/123')
->cache(false)
->run();Cache TTL can be configured in config/smart-scraper.php:
'cache' => [
'ttl' => 3600, // 1 hour
],Prevent overwhelming target websites with rate limiting:
// Rate limiting is enabled by default
$data = ProductScraper::scrape('https://example.com/product/123')->run();
// Disable rate limiting
$data = ProductScraper::scrape('https://example.com/product/123')
->rateLimit(false)
->run();Configure rate limits in config/smart-scraper.php:
'rate_limit' => [
'enabled' => true,
'max_requests' => 10, // Max 10 requests
'per_seconds' => 60, // Per 60 seconds
],User agents are automatically rotated to avoid detection:
// Rotation is enabled by default
$data = ProductScraper::scrape('https://example.com/product/123')->run();Configure user agents in config/smart-scraper.php:
'user_agent' => [
'rotation_enabled' => true,
'agents' => [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...',
// Add more user agents
],
],Full proxy support with authentication:
// Without authentication
$data = ProductScraper::scrape('https://example.com/product/123')
->proxy('200.20.14.84:40200')
->run();
// With authentication
$data = ProductScraper::scrape('https://example.com/product/123')
->proxy('200.20.14.84:40200', 'username', 'password')
->run();Automatic retry with exponential backoff:
// Retry 3 times, wait 1 second between attempts
$data = ProductScraper::scrape('https://example.com/product/123')
->retry(3, 1)
->run();Configure default retry settings in config/smart-scraper.php:
'retry' => [
'enabled' => true,
'max_attempts' => 3,
'initial_delay' => 1, // seconds
'max_delay' => 60, // seconds
'backoff_multiplier' => 2, // Exponential backoff
'retryable_status_codes' => [408, 429, 500, 502, 503, 504],
],Capture screenshots of web pages:
// Save screenshot to file
$data = ProductScraper::scrape('https://example.com/product/123')
->screenshot(true, storage_path('app/screenshots/product.png'))
->run();
// Get screenshot as base64
$data = ProductScraper::scrape('https://example.com/product/123')
->screenshot(true)
->run();
$screenshotBase64 = $data['screenshot'] ?? null;Generate PDFs from web pages:
// Save PDF to file
$data = ProductScraper::scrape('https://example.com/product/123')
->pdf(true, storage_path('app/pdfs/product.pdf'))
->run();
// Get PDF as base64
$data = ProductScraper::scrape('https://example.com/product/123')
->pdf(true)
->run();
$pdfBase64 = $data['pdf'] ?? null;Add custom headers to requests:
$data = ProductScraper::scrape('https://example.com/product/123')
->headers([
'Accept-Language' => 'en-US,en;q=0.9',
'Accept' => 'text/html,application/xhtml+xml',
'X-Custom-Header' => 'value',
])
->run();Validate extracted data against a schema:
use Shammaa\LaravelSmartScraper\Services\SchemaValidatorService;
$data = ProductScraper::scrape('https://example.com/product/123')
->validate(function ($data) {
$validator = new SchemaValidatorService();
return $validator->validate($data, [
'title' => ['required' => true, 'type' => 'string'],
'price' => ['required' => true, 'type' => 'string'],
'description' => ['required' => false, 'type' => 'string'],
]);
})
->run();Create custom middleware to modify requests:
use Shammaa\LaravelSmartScraper\Contracts\MiddlewareInterface;
class CustomHeaderMiddleware implements MiddlewareInterface
{
public function handle(array $options): array
{
$options['headers']['X-Custom'] = 'value';
return $options;
}
}
// Use middleware
$data = ProductScraper::scrape('https://example.com/product/123')
->middleware(new CustomHeaderMiddleware())
->run();php artisan make:scraper ProductScraperphp artisan list:scrapersphp artisan scraper:test "App\Scrapers\ProductScraper" "https://example.com/product/123"The scraper can intelligently detect different websites and automatically use the appropriate selectors for each site. This means you can scrape multiple websites with different HTML structures using the same scraper class!
- Automatic Site Detection - The scraper detects the site type from URL patterns or HTML patterns
- Smart Selectors - Uses site-specific selectors if available, falls back to generic selectors
- Fallback System - If a selector fails, it automatically tries the next one
Use the --smart flag when creating a scraper:
php artisan make:scraper ProductScraper --smartThis creates a scraper with smart selectors:
<?php
namespace App\Scrapers;
use Shammaa\LaravelSmartScraper\Scraper;
class ProductScraper extends Scraper
{
protected function handle(): array
{
$smart = $this->smart();
return [
// Smart extraction - tries multiple selectors automatically
'title' => $smart->extract('title', [
'h1',
'.title',
'[itemprop="name"]',
'title',
]),
'price' => $smart->extract('price', [
'.price',
'[itemprop="price"]',
'.amount',
'.cost',
]),
'image' => $smart->extractAttribute('image', [
'img.main-image',
'.product-image img',
'[itemprop="image"]',
'img',
], 'src'),
'description' => $smart->extract('description', [
'.description',
'[itemprop="description"]',
'.content',
'p',
]),
];
}
}Tries multiple selectors until one works:
$title = $smart->extract('title', [
'h1.product-title',
'h1',
'.title',
'[itemprop="name"]',
], 'Default Title');Tries multiple selectors to extract an attribute:
$image = $smart->extractAttribute('image', [
'img.main-image',
'.product-image img',
'[itemprop="image"]',
], 'src', 'default.jpg');Extracts multiple elements:
$tags = $smart->extractMultiple('tags', [
'.tag',
'.tags a',
'[itemprop="keywords"]',
], function ($node) {
return $node->text();
});Define site profiles in config/smart-scraper.php:
'site_profiles' => [
'amazon' => [
'url_patterns' => [
'/amazon\.(com|co\.uk|de|fr|it|es|ca|com\.au)/',
],
'html_patterns' => [
'#nav-logo' => null,
'[data-asin]' => null,
],
'selectors' => [
'title' => [
'#productTitle',
'h1.a-size-large',
'h1',
],
'price' => [
'.a-price .a-offscreen',
'#priceblock_dealprice',
'#priceblock_saleprice',
],
],
],
'ebay' => [
'url_patterns' => [
'/ebay\.(com|co\.uk|de|fr|it|es|ca|com\.au)/',
],
'html_patterns' => [
'#gh-logo' => null,
'[data-testid="x-item-title-label"]' => null,
],
'selectors' => [
'title' => [
'h1[data-testid="x-item-title-label"]',
'h1.it-ttl',
'h1',
],
'price' => [
'.notranslate',
'.u-flL.condText',
],
],
],
],- URL Pattern Matching - First, tries to match URL patterns
- HTML Pattern Matching - If URL doesn't match, analyzes HTML structure
- Selector Priority - Uses site-specific selectors first, then falls back to generic ones
use App\Scrapers\ProductScraper;
// Works with Amazon
$amazonData = ProductScraper::scrape('https://amazon.com/product/123')->run();
// Works with eBay
$ebayData = ProductScraper::scrape('https://ebay.com/itm/123')->run();
// Works with any e-commerce site
$genericData = ProductScraper::scrape('https://example-shop.com/product/123')->run();The same scraper automatically adapts to each site's structure!
You can also manually set or check the site type:
protected function handle(): array
{
$siteType = $this->getSiteType(); // 'amazon', 'ebay', null, etc.
if ($siteType === 'amazon') {
// Amazon-specific logic
} elseif ($siteType === 'ebay') {
// eBay-specific logic
}
// Or set manually
$this->setSiteType('custom-site');
return [];
}Monitoring is enabled by default. All scraping activities are logged:
// Logs are automatically created
$data = ProductScraper::scrape('https://example.com/product/123')->run();Check logs in storage/logs/laravel.log:
[2024-01-01 12:00:00] local.INFO: Scraping started {"url":"https://example.com/product/123",...}
[2024-01-01 12:00:02] local.INFO: Scraping completed {"url":"https://example.com/product/123","duration":"2.5s",...}
Configure monitoring in config/smart-scraper.php:
'monitoring' => [
'enabled' => true,
'log_channel' => 'stack',
'track_metrics' => true,
],All configuration options are available in config/smart-scraper.php:
return [
'cache' => [
'driver' => 'file',
'ttl' => 3600,
'prefix' => 'smart_scraper',
],
'rate_limit' => [
'enabled' => true,
'max_requests' => 10,
'per_seconds' => 60,
],
'puppeteer' => [
'node_path' => 'node',
'script_path' => __DIR__ . '/../resources/js/scraper.js',
'timeout' => 30000,
'headless' => true,
],
// ... more options
];If you're using Node.js via NVM and running scrapers via scheduled tasks, Node might not be available. To fix this:
- Edit your
~/.bash_profile:
nano ~/.bash_profile- Add at the top:
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"- Reload:
source ~/.bash_profileNote: It's not recommended to use NVM in production environments.
Issue: Puppeteer execution failed
Solution: Make sure Node.js and Puppeteer dependencies are installed:
node --version
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealthIssue: Rate limit exceeded
Solution: Adjust rate limit settings or disable rate limiting:
->rateLimit(false)Issue: Data validation failed
Solution: Check your validation schema and ensure data matches expected types.
<?php
namespace App\Scrapers;
use Shammaa\LaravelSmartScraper\Scraper;
class ProductScraper extends Scraper
{
protected function handle(): array
{
$crawler = $this->getCrawler();
return [
'title' => $crawler->filter('h1.product-title')->text(''),
'price' => $crawler->filter('.price')->text(''),
'currency' => $crawler->filter('.currency')->text(''),
'description' => $crawler->filter('.product-description')->text(''),
'images' => $crawler->filter('.product-images img')->each(function ($node) {
return $node->attr('src');
}),
'rating' => $crawler->filter('.rating')->text(''),
'reviews_count' => $crawler->filter('.reviews-count')->text(''),
];
}
}
// Usage
$data = ProductScraper::scrape('https://example.com/product/123')
->timeout(15000)
->retry(3, 2)
->run();<?php
namespace App\Scrapers;
use Shammaa\LaravelSmartScraper\Scraper;
class NewsScraper extends Scraper
{
protected function handle(): array
{
$crawler = $this->getCrawler();
return [
'title' => $crawler->filter('h1.article-title')->text(''),
'author' => $crawler->filter('.article-author')->text(''),
'published_at' => $crawler->filter('.article-date')->attr('datetime'),
'content' => $crawler->filter('.article-content')->html(),
'tags' => $crawler->filter('.article-tags a')->each(function ($node) {
return $node->text();
}),
'image' => $crawler->filter('.article-image img')->attr('src'),
];
}
}
// Usage with screenshot
$data = NewsScraper::scrape('https://example.com/news/article-123')
->screenshot(true, storage_path('app/screenshots/article.png'))
->run();use App\Scrapers\ProductScraper;
use Shammaa\LaravelSmartScraper\Services\ConcurrentScraperService;
$urls = [
'https://example.com/product/1',
'https://example.com/product/2',
'https://example.com/product/3',
];
$concurrentScraper = new ConcurrentScraperService(maxConcurrent: 5);
$results = $concurrentScraper->scrape($urls, function ($url) {
return ProductScraper::scrape($url)->run();
});
foreach ($results as $url => $data) {
echo "Scraped: {$url}\n";
print_r($data);
}Contributions are welcome! Please feel free to submit a Pull Request.
This package is open-sourced software licensed under the MIT license.
Built with β€οΈ by Shadi Shammaa
Made with Laravel Smart Scraper - Professional web scraping made easy! π