Skip to content

BE-5: Website Availability & Broken Link Detection System Research #5

@tecnodeveloper

Description

@tecnodeveloper

Description:
Research how to detect whether a website is live or broken before scraping. Understand HTTP status codes, uptime checks, broken link detection methods, and automated monitoring systems. This is required for Price WatchDog to avoid scraping invalid product pages.


User Story

Given I have a product URL
When the system checks the link
Then it should confirm if the website is live, broken, or inaccessible before scraping


Tasks


Website Availability Basics

  1. Understand Website Status Checking

    • What is website uptime
    • What is downtime
    • What causes site failure
  2. HTTP Status Codes

    • 200 → OK (working site)
    • 301/302 → Redirect
    • 404 → Not Found
    • 500 → Server Error
    • 403 → Forbidden

Broken Link Detection

  1. What is a Broken Link

    • Dead URL
    • Removed product page
    • Redirect loop
  2. Detection Strategy

    • Send HTTP request
    • Check status code
    • Validate page content

Request-Based Checking

  1. Basic URL Validation

    • Send GET request
    • Check response status
    • Timeout handling
  2. Header Inspection

    • Content-Type check
    • Server response validation

Advanced Detection Methods

  1. Content Validation

    • Detect “Product Not Found” text
    • Detect empty pages
    • Detect error templates
  2. Redirect Handling

    • Follow redirects
    • Detect infinite redirect loops

Tools for Link Checking

  1. HTTP Libraries

    • requests (Python)
    • aiohttp (async requests)
  2. Automation Tools

  • Playwright
  • Selenium for rendering pages

Uptime Monitoring Concepts

  1. Website Monitoring System
  • Periodic URL checks
  • Store status history
  • Alert on failure
  1. Polling Strategy
  • Check every 15 min
  • Increase interval for stable sites

Error Handling

  1. Common Failures
  • Timeout errors
  • DNS failures
  • Server overload
  1. Retry Logic
  • Retry failed requests
  • Exponential backoff

Performance Optimization

  1. Efficient Checking System
  • Batch URL checks
  • Async requests
  • Cache results

Scalability Design

  1. Large Scale Monitoring
  • 1000+ URLs handling
  • Queue-based system
  • Distributed checking

Data Storage Design

  1. Store URL Status
  • URL
  • Status code
  • Last checked time
  • Availability flag

Real-World Use Cases

  1. E-Commerce Scenarios
  • Product removed
  • Out of stock page
  • Region-based blocking

False Positives Handling

  1. Avoid Wrong Detection
  • JS-rendered pages
  • Lazy loading content
  • Bot-block pages

Monitoring Strategy

  1. Continuous Monitoring System
  • Scheduled checks
  • Alert system
  • Logging failures

Acceptance Criteria

  • Website availability concepts understood
  • Broken link detection strategy defined
  • HTTP status handling mapped
  • Tools identified
  • Monitoring system design completed

Testing Steps

  1. Test live URL
  2. Test broken URL (404)
  3. Test redirect URL
  4. Test blocked URL (403)
  5. Simulate timeout

Definition of Done

  • Link detection system fully designed
  • Monitoring strategy defined
  • Tools selected

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions