Skip to content

v1.0.0 - Official release, the start of a growing database

Choose a tag to compare

@ZA1815 ZA1815 released this 29 Oct 22:31
· 5 commits to main since this release

πŸŽ‰ caniscrape v1.0.0 Release Notes

The first major release! v1.0.0 marks caniscrape's transition from a standalone CLI tool to a complete cloud-connected platform for tracking website protections over time.

πŸš€ What's New in v1.0.0

1. Cloud Integration ☁️

Connect your local CLI to caniscrape Cloud for persistent scan history, team collaboration, and protection change tracking.

Key features:

  • Project Management: Create projects to organize scans by purpose (e.g., "E-commerce Scraper", "News Aggregator")
  • Automatic Sync: Enable auto-upload to push every scan to the cloud instantly
  • Scan History: Track how site protections change over time
  • Smart Diffing: Automatically compare new scans against previous ones to detect protection changes
  • Offline Support: Scans cache locally when offline, push them later with caniscrape push

New commands:

caniscrape init          # Link to a cloud project
caniscrape link          # Connect to an existing project
caniscrape push          # Upload cached scans
caniscrape config set    # Configure auto-upload settings

Example workflow:

# One-time setup
caniscrape init

# Every scan now automatically syncs to cloud
caniscrape scan https://example.com

# View history at https://caniscrape.org/projects

2. Privacy-First Telemetry πŸ“Š

Two separate, opt-in telemetry systems to improve caniscrape.

Usage Telemetry (anonymous):

  • CLI version, Python version, OS type
  • Commands used and success/failure rates
  • Error types (no URLs or personal data)
  • Completely anonymous with device ID only

Public Scan Database (like Shodan for anti-bot defenses):

  • Opt-in contribution of scan results to a searchable public database
  • See how site protections change over time across all users
  • Compare different sites' protection strategies
  • Currently free while building the database

Full control:

caniscrape telemetry usage on/off     # Toggle usage telemetry
caniscrape telemetry scans on/off     # Toggle scan contributions
caniscrape telemetry delete           # GDPR data deletion
caniscrape telemetry status           # View current settings

What we DON'T collect:

  • Your name, email, or IP address (usage telemetry)
  • Authentication tokens or credentials (scan telemetry)
  • Any personally identifiable information

3. Scan Comparison & Change Detection πŸ”„

Automatically detect when site protections change between scans:

Detected changes:

  • Difficulty score increases/decreases
  • New protections added (WAF, CAPTCHA, fingerprinting)
  • Protections removed or disabled
  • Status changes for existing protections

Example output:

πŸ“Š Changes Since Last Scan (2025-10-15 14:30)

⚠️  Difficulty Score: +3 points (site got harder to scrape)

⚠️  New Protections Detected:
  + WAF: DataDome
  + Canvas Fingerprinting

βœ… Protections Removed:
  - CAPTCHA: reCAPTCHA v2

4. Improved CLI Structure ⚑

The CLI has been restructured for better organization and extensibility:
New command structure:

caniscrape scan <url>           # Analyze a website (replaces direct URL)
caniscrape init                 # Initialize cloud project
caniscrape link                 # Link to existing project
caniscrape push                 # Push cached scans
caniscrape config set/show      # Manage configuration
caniscrape telemetry            # Manage telemetry settings

Backward compatibility note:

# Doesn't work
caniscrape <url>
# Works (new syntax)
caniscrape scan <url>

5. Improved Error Handling & UX ✨

  • Better error messages with actionable guidance
  • Clear prompts for authentication and setup
  • Informative status messages during long operations
  • Graceful handling of network failures and timeouts
  • Improved progress indicators for multi-step operations

6. Configuration Management βš™οΈ

Fine-grained control over CLI behavior:

# Enable/disable auto-upload
caniscrape config set auto-upload on
caniscrape config set auto-upload off
# View current configuration
caniscrape config show

Configuration hierarchy:

  • Searches parent directories for .caniscrape/config
  • Allows different projects in subdirectories
  • Works like git's configuration system

πŸ”§ Technical Improvements

API Client

  • Robust error handling with retry logic
  • Rate limit detection and user-friendly messages
  • Token expiration handling with re-authentication prompts
  • Proper timeout management

Caching System

  • Local scan results cache in .caniscrape/cache/
  • Automatic cache cleanup after successful push
  • Metadata tracking (timestamp, CLI version, URL)
  • Works as fallback when offline or rate-limited

Diff Engine

  • Intelligent comparison of scan results
  • Handles schema changes between versions
  • Filters out noise (e.g., duplicate Cloudflare detections)
  • Clear visualization of changes

πŸ“Š Updated Scoring System

The difficulty scoring from v0.3.0 remains unchanged, but now integrates with cloud tracking:

Detection Impact
Known bot detection service detected +2 points
Canvas fingerprinting signal +1 point
Browser function modifications +1 point
CAPTCHA on page load +5 points
CAPTCHA after rate limit +4 points
DataDome/PerimeterX WAF +4 points
Akamai/Imperva WAF +3 points
Aggressive rate limiting +3 points
Cloudflare WAF +2 points
Honeypot traps detected +2 points
TLS fingerprinting active +1 point

Score interpretation remains the same:

  • 0-2: Easy (basic scraping will work)
  • 3-4: Medium (need some precautions)
  • 5-7: Hard (requires advanced techniques)
  • 8-10: Very Hard (consider using a service)Score interpretation:

πŸŽ“ New Use Cases

For Consultants

  • Historical data: Show clients how site protections evolved
  • Professional presentation: Cloud dashboard looks more professional than CLI output

For Long-Term Monitoring

  • Protection tracking: See when sites add/remove defenses
  • Seasonal patterns: Identify when sites tighten security (e.g., Black Friday)
  • Regression detection: Get alerted when sites become easier to scrape

🚧 Breaking Changes

Command Structure
Old (v0.3.0):

caniscrape https://example.com

New (v1.0.0):

caniscrape scan https://example.com

πŸ“ Coming in v1.1.0

Scheduled Scans: Automatic re-scanning on a schedule

πŸ› Bug Fixes

  • Fixed double-counting of Cloudflare in both WAF and fingerprinting detection
  • Improved proxy handling in CAPTCHA solver integration
  • Better error messages when wafw00f is missing
  • Fixed edge cases in diff engine when comparing old scan formats
  • Corrected behavioral detector link counting logic

πŸ™ Acknowledgments

  • Community feedback: Thank you to everyone who tested v0.3.0
  • Dependencies: Built on wafw00f, Playwright, curl_cffi, and other amazing open-source projects

πŸ“¬ Feedback & Support

GitHub Issues: https://github.com/ZA1815/caniscrape/issues
Documentation: https://docs.caniscrape.org (coming soon)
Cloud Dashboard: https://caniscrape.org


Happy scraping! πŸŽ‰