# 15. Web Crawling & Scraping Concepts\n
\n
A **Web Crawler** (spider) systematically browses the World Wide Web for indexing.\n
\n
## Key Components:\n
1. **Frontier**: List of URLs to visit.\n
2. **Fetcher**: Retrieves page content.\n
3. **Parser**: Extracts links and content.\n
4. **Politeness Policy**: Respect `robots.txt` and avoid overloading servers.

In [1]:
from collections import deque
import time
import random
import urllib.parse

# Simulation Setup\n
# Since we can't crawl the real web, we simulate a mini-web.\n
WEB_GRAPH = {
    "http://nepal.gov.np": ["http://nepal.gov.np/news", "http://nepal.gov.np/contact"],
    "http://nepal.gov.np/news": ["http://nepal.gov.np/news/article1", "http://news.com/nepal"],
    "http://news.com/nepal": ["http://news.com", "http://nepal.gov.np"],
    "http://news.com": ["http://news.com/world", "http://news.com/tech"],
    "http://example.org": ["http://example.org/about"],
}

def fetch_url(url):
    # Simulate network delay\n
    time.sleep(0.1)
    
    if url in WEB_GRAPH:
        return 200, WEB_GRAPH[url]
    return 404, []

## 1. BFS Crawler (Breadth-First Search)\n
Standard crawling strategy: visit all neighbors before going deeper.

In [2]:
def bfs_crawl(start_url, max_pages=10):
    frontier = deque([start_url])
    visited = set()
    crawled_count = 0
    
    print(f"Starting BFS Crawl at: {start_url}")
    
    while frontier and crawled_count < max_pages:
        url = frontier.popleft()
        
        if url in visited:
            continue
            
        # Fetch\n
        print(f"  Fetching: {url} ...", end="")
        status, links = fetch_url(url)
        
        if status == 200:
            print(" OK")
            visited.add(url)
            crawled_count += 1
            
            # Extract Links & Add to Frontier\n
            for link in links:
                if link not in visited:
                    frontier.append(link)
        else:
            print(" Failed (404)")
            
    print(f"Crawl complete. Visited {len(visited)} pages.")

bfs_crawl("http://nepal.gov.np")

Starting BFS Crawl at: http://nepal.gov.np
  Fetching: http://nepal.gov.np ... OK
  Fetching: http://nepal.gov.np/news ... OK
  Fetching: http://nepal.gov.np/contact ... Failed (404)
  Fetching: http://nepal.gov.np/news/article1 ... Failed (404)
  Fetching: http://news.com/nepal ... OK
  Fetching: http://news.com ... OK
  Fetching: http://news.com/world ... Failed (404)
  Fetching: http://news.com/tech ... Failed (404)
Crawl complete. Visited 4 pages.


## 2. Politeness & Robots Exclusion\n
Real crawlers must respect `robots.txt`.

In [3]:
class RobotExclusion:
    def __init__(self):
        self.disallowed = {}
        
    def add_rule(self, domain, path):
        if domain not in self.disallowed:
            self.disallowed[domain] = []
        self.disallowed[domain].append(path)
        
    def can_fetch(self, url):
        parsed = urllib.parse.urlparse(url)
        domain = f"{parsed.scheme}://{parsed.netloc}"
        
        if domain in self.disallowed:
            for path in self.disallowed[domain]:
                if parsed.path.startswith(path):
                    return False
        return True

# Test Robot Rules\n
robots = RobotExclusion()
robots.add_rule("http://nepal.gov.np", "/admin")
robots.add_rule("http://news.com", "/private")

test_urls = [
    "http://nepal.gov.np/news",
    "http://nepal.gov.np/admin/login",
    "http://news.com/tech",
    "http://news.com/private/docs"
]

print("\nChecking Robot Rules:")
for u in test_urls:
    allowed = robots.can_fetch(u)
    status = "Allowed" if allowed else "Blocked"
    print(f"  {u} -> {status}")


Checking Robot Rules:
  http://nepal.gov.np/news -> Allowed
  http://nepal.gov.np/admin/login -> Blocked
  http://news.com/tech -> Allowed
  http://news.com/private/docs -> Blocked


## 3. URL Normalization\n
URLs must be canonicalized to avoid duplicates.\n
- `http://example.com` == `http://example.com/`\n
- `http://EXAMPLE.COM` == `http://example.com`

In [4]:
def normalize_url(url):
    parsed = urllib.parse.urlparse(url)
    scheme = parsed.scheme.lower()
    netloc = parsed.netloc.lower()
    path = parsed.path
    
    if not path:
        path = "/"
        
    return f"{scheme}://{netloc}{path}"

examples = [
    "HTTP://Nepal.Gov.NP",
    "http://nepal.gov.np",
    "http://nepal.gov.np/"
]

print("\nNormalization:")
for e in examples:
    print(f"  {e} -> {normalize_url(e)}")


Normalization:
  HTTP://Nepal.Gov.NP -> http://nepal.gov.np/
  http://nepal.gov.np -> http://nepal.gov.np/
  http://nepal.gov.np/ -> http://nepal.gov.np/
