# Web Crawling for Policy Analysis
## GCAP3226: Empowering Citizens through Data

**Learning Objectives:**
1. Understand ethical web crawling practices
2. Analyze robots.txt and sitemaps
3. Identify pages with quantitative data
4. Extract and organize data from government websites
5. Apply these techniques to policy research

**Case Study:** Cyberdefender.hk - Hong Kong Government Cybersecurity Portal

## Part 1: Understanding Web Crawling Ethics

### What is robots.txt?
The `robots.txt` file tells web crawlers which parts of a website they can access. It's a fundamental part of web crawling ethics.

**Key Rules:**
- Always check robots.txt before crawling
- Respect the directives (Disallow, Allow)
- Implement rate limiting to avoid overloading servers
- Identify your crawler with a descriptive User-Agent

Let's check cyberdefender.hk's robots.txt:

In [None]:
import requests
import time
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import re
import pandas as pd
from urllib.parse import urljoin, urlparse

# Target website
BASE_URL = "https://cyberdefender.hk"

# Fetch robots.txt
robots_url = f"{BASE_URL}/robots.txt"
response = requests.get(robots_url)
print("robots.txt content:")
print("=" * 50)
print(response.text)
print("=" * 50)
print("\n✓ Analysis: No crawling restrictions! (Disallow: is empty)")

### Exercise 1.1: Check robots.txt for other government websites

Try checking robots.txt for these Hong Kong government sites:
- https://www.info.gov.hk
- https://www.censtatd.gov.hk (Census and Statistics Department)
- https://www.epd.gov.hk (Environmental Protection Department)

**Question:** Do they all allow crawling? Are there any restrictions?

In [None]:
# Your code here:
# Try fetching robots.txt from the sites above

## Part 2: Discovering Content with Sitemaps

Sitemaps are XML files that list all pages on a website. They make crawling much more efficient!

### Sitemap Index
Large websites often have a `sitemap_index.xml` that points to multiple sitemaps.

In [None]:
# Fetch sitemap index
sitemap_index_url = f"{BASE_URL}/sitemap_index.xml"
response = requests.get(sitemap_index_url)

# Parse XML
root = ET.fromstring(response.content)
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

# Extract sitemap URLs
sitemaps = []
for sitemap in root.findall('.//ns:sitemap', namespace):
    loc = sitemap.find('ns:loc', namespace)
    lastmod = sitemap.find('ns:lastmod', namespace)
    if loc is not None:
        sitemaps.append({
            'url': loc.text,
            'last_modified': lastmod.text if lastmod is not None else 'N/A',
            'name': loc.text.split('/')[-1]
        })

# Display as DataFrame
df_sitemaps = pd.DataFrame(sitemaps)
print(f"Found {len(sitemaps)} sitemaps:\n")
print(df_sitemaps.to_string(index=False))

### Key Observation
Notice the `sdm_downloads-sitemap.xml` - this contains downloadable files, which often include quantitative data like reports, statistics, and datasets!

In [None]:
# Function to fetch and parse a sitemap
def fetch_sitemap(sitemap_url):
    """Fetch and parse a sitemap XML file"""
    response = requests.get(sitemap_url)
    root = ET.fromstring(response.content)
    namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
    
    urls = []
    for url_elem in root.findall('.//ns:url', namespace):
        loc = url_elem.find('ns:loc', namespace)
        if loc is not None:
            urls.append(loc.text)
    
    return urls

# Fetch all URLs from all sitemaps
all_urls = {}
for sitemap in sitemaps:
    time.sleep(1)  # Rate limiting
    urls = fetch_sitemap(sitemap['url'])
    all_urls[sitemap['name']] = urls
    print(f"{sitemap['name']}: {len(urls)} URLs")

total_urls = sum(len(urls) for urls in all_urls.values())
print(f"\nTotal URLs to crawl: {total_urls}")

## Part 3: Detecting Quantitative Data

Not all pages have useful quantitative data. We need to be selective!

### Detection Criteria:
1. **Numbers:** Percentages, currency, statistics
2. **Tables:** Structured data
3. **Keywords:** "statistics", "data", "analysis", "report", "survey"
4. **Charts/Graphs:** Visual data representations
5. **Downloadable Files:** PDFs, Excel files, CSV files

In [None]:
def detect_quantitative_data(html_content, url):
    """
    Analyze HTML content to detect quantitative data
    Returns: (has_data, metadata)
    """
    soup = BeautifulSoup(html_content, 'html.parser')
    text = soup.get_text()
    
    metadata = {
        'url': url,
        'has_numbers': False,
        'has_tables': False,
        'has_charts': False,
        'number_count': 0,
        'table_count': 0,
        'keywords': []
    }
    
    # 1. Count meaningful numbers
    number_patterns = [
        r'\d+(?:\.\d+)?%',  # Percentages
        r'\$\d+(?:,\d{3})*',  # Currency
        r'HK\$\d+(?:,\d{3})*',  # HK currency
        r'\d{1,3}(?:,\d{3})+',  # Numbers with commas
    ]
    
    for pattern in number_patterns:
        matches = re.findall(pattern, text)
        metadata['number_count'] += len(matches)
    
    if metadata['number_count'] > 5:
        metadata['has_numbers'] = True
    
    # 2. Check for tables
    tables = soup.find_all('table')
    metadata['table_count'] = len(tables)
    metadata['has_tables'] = len(tables) > 0
    
    # 3. Check for keywords
    keywords = ['statistics', 'data', 'analysis', 'report', 'survey',
                'percentage', 'total', 'average', 'rate']
    
    for keyword in keywords:
        if keyword.lower() in text.lower():
            metadata['keywords'].append(keyword)
    
    # 4. Check for charts
    chart_indicators = ['chart', 'graph', 'diagram']
    for indicator in chart_indicators:
        if indicator.lower() in text.lower():
            metadata['has_charts'] = True
            break
    
    # Determine if page has quantitative data
    has_data = (
        metadata['has_numbers'] or 
        metadata['has_tables'] or
        (metadata['has_charts'] and len(metadata['keywords']) > 0)
    )
    
    return has_data, metadata

print("✓ Quantitative data detection function defined")

### Example: Test the Detection Function

In [None]:
# Test on a sample page
test_url = all_urls['post-sitemap.xml'][0] if 'post-sitemap.xml' in all_urls else BASE_URL

print(f"Testing URL: {test_url}\n")
response = requests.get(test_url)
has_data, metadata = detect_quantitative_data(response.content, test_url)

print(f"Has quantitative data: {has_data}")
print(f"\nDetection details:")
print(f"  Numbers found: {metadata['number_count']}")
print(f"  Tables found: {metadata['table_count']}")
print(f"  Has charts: {metadata['has_charts']}")
print(f"  Keywords: {', '.join(metadata['keywords'][:5])}")

## Part 4: Implementing Ethical Crawling

### Best Practices:
1. **Rate Limiting:** Wait between requests (1-2 seconds)
2. **User-Agent:** Identify your crawler
3. **Error Handling:** Handle 404s, timeouts gracefully
4. **Logging:** Track what you're doing
5. **Respect robots.txt:** Always follow the rules

In [None]:
def crawl_with_ethics(urls, max_pages=10):
    """
    Ethically crawl a list of URLs
    """
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'GCAP3226-Student-Crawler/1.0 (Educational Purpose)'
    })
    
    results = []
    
    for i, url in enumerate(urls[:max_pages], 1):
        print(f"[{i}/{min(len(urls), max_pages)}] Crawling: {url}")
        
        try:
            response = session.get(url, timeout=10)
            response.raise_for_status()
            
            has_data, metadata = detect_quantitative_data(response.content, url)
            results.append({
                'url': url,
                'has_data': has_data,
                'numbers': metadata['number_count'],
                'tables': metadata['table_count'],
                'keywords': ', '.join(metadata['keywords'][:3])
            })
            
            if has_data:
                print(f"  ✓ Found quantitative data!")
            
        except Exception as e:
            print(f"  ✗ Error: {str(e)}")
            results.append({
                'url': url,
                'has_data': False,
                'numbers': 0,
                'tables': 0,
                'keywords': f'Error: {str(e)}'
            })
        
        # Rate limiting - wait 2 seconds between requests
        time.sleep(2)
    
    return pd.DataFrame(results)

print("✓ Ethical crawler function defined")

### Exercise 4.1: Crawl Sample Pages

In [None]:
# Crawl first 5 pages from the posts sitemap
sample_urls = all_urls.get('post-sitemap.xml', [])[:5]

print(f"Crawling {len(sample_urls)} sample pages...\n")
results_df = crawl_with_ethics(sample_urls, max_pages=5)

print("\n" + "="*80)
print("RESULTS")
print("="*80)
print(results_df.to_string(index=False))

# Summary statistics
print(f"\nPages with data: {results_df['has_data'].sum()} / {len(results_df)}")
print(f"Success rate: {results_df['has_data'].sum() / len(results_df) * 100:.1f}%")

## Part 5: Comparing with Google Site-Specific Search

Let's compare our crawler with Google's site-specific search.

### Google Site Search Syntax:
```
site:cyberdefender.hk "statistics" OR "data" OR "report"
```

### Advantages of Crawler:
1. **Systematic:** Covers all pages via sitemap
2. **Programmable:** Automated data extraction
3. **Customizable:** Your own detection criteria
4. **Downloadable:** Save pages and files locally
5. **Reproducible:** Can rerun with same parameters

### Advantages of Google Search:
1. **Fast:** Instant results
2. **Smart:** Better keyword matching
3. **No coding:** User-friendly interface
4. **No restrictions:** Google handles rate limiting

**Conclusion:** Use both! Google for quick discovery, crawlers for systematic data collection.

## Part 6: Your Assignment - Crawl a Government Website

Choose one of these Hong Kong government websites and apply what you've learned:

1. **Census and Statistics Department:** https://www.censtatd.gov.hk
   - Rich in quantitative data
   - Population, economy, social indicators

2. **Environmental Protection Department:** https://www.epd.gov.hk
   - Air quality data
   - Waste statistics
   - Environmental reports

3. **Transport Department:** https://www.td.gov.hk
   - Traffic statistics
   - Public transport data
   - Road safety reports

4. **Food and Health Bureau:** https://www.fhb.gov.hk
   - Healthcare statistics
   - Disease surveillance data
   - Hospital data

### Assignment Tasks:

1. **Check robots.txt** - Are you allowed to crawl?
2. **Find sitemap** - Does the site have a sitemap?
3. **Identify target pages** - What pages likely have quantitative data?
4. **Crawl ethically** - Implement rate limiting and proper User-Agent
5. **Extract data** - Download pages and files with quantitative data
6. **Analyze results** - What did you find? What data is useful for policy analysis?
7. **Document** - Write a brief report on your findings

### Deliverables:
1. Python code (use the functions from this notebook)
2. Data files collected
3. Summary report (500-1000 words)
4. Reflection on ethical considerations

In [None]:
# Your assignment code here:
# 1. Choose a government website
# 2. Implement your crawler
# 3. Collect and analyze data

# Example starter code:
# YOUR_TARGET_URL = "https://www.censtatd.gov.hk"  # Change this
# 
# # Check robots.txt
# response = requests.get(f"{YOUR_TARGET_URL}/robots.txt")
# print(response.text)
# 
# # Your code continues...

## Part 7: Ethical Considerations and Best Practices

### Legal and Ethical Framework:

1. **Copyright:** Respect intellectual property rights
2. **Privacy:** Don't collect personal data
3. **Terms of Service:** Read and follow website terms
4. **Server Load:** Don't overwhelm servers
5. **Purpose:** Use data for legitimate research/analysis

### Rate Limiting Guidelines:
- Small sites: 2-3 seconds between requests
- Medium sites: 1-2 seconds
- Large sites (e.g., government): 0.5-1 second
- **Never** faster than 0.5 seconds

### When NOT to Crawl:
- robots.txt explicitly disallows
- Site requires login
- Terms of service prohibit it
- Data is available via API
- Site is slow or unstable

### Alternative: APIs
Many government sites offer APIs for data access:
- **Hong Kong Open Data:** https://data.gov.hk
- Usually better than crawling!
- Structured data, legal access, official support

## Summary

**What You Learned:**
1. ✓ How to check robots.txt and respect crawling rules
2. ✓ How to discover content using sitemaps
3. ✓ How to detect quantitative data in web pages
4. ✓ How to implement ethical crawling practices
5. ✓ How to compare crawler vs. Google search approaches
6. ✓ How to apply these skills to government websites

**Next Steps:**
- Complete the assignment
- Explore Hong Kong Open Data portal
- Learn about web scraping libraries (Scrapy, Selenium)
- Study data analysis techniques for policy research

**Resources:**
- [Web Scraping Best Practices](https://www.scrapehero.com/web-scraping-best-practices/)
- [Hong Kong Open Data](https://data.gov.hk)
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/)
- [Robots.txt Specification](https://www.robotstxt.org/)