# Topic 30: Web Scraping for Data Collection

## Overview
Web scraping is the process of automatically extracting data from websites[7][23]. Python offers powerful libraries like Requests and Beautiful Soup that make web scraping accessible and efficient for data collection.

### What You'll Learn:
- Web scraping fundamentals and ethics
- HTTP requests with the Requests library
- HTML parsing with Beautiful Soup
- Handling different data formats (JSON, XML)
- Managing sessions, cookies, and authentication
- Best practices and common challenges

---

## 1. Web Scraping Fundamentals

Understanding the basics of web scraping and HTTP:

In [None]:
# Web Scraping Fundamentals
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
import re
from urllib.parse import urljoin, urlparse
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

print("Web Scraping Fundamentals:")
print("=" * 26)

print("1. What is Web Scraping?")
web_scraping_steps = [
    "1. Send HTTP request to target website",
    "2. Download HTML content from server", 
    "3. Parse HTML to extract desired information",
    "4. Clean and structure the extracted data",
    "5. Store data in desired format (CSV, JSON, database)"
]

for step in web_scraping_steps:
    print(f"   {step}")

print(f"\n2. Web Scraping Ethics and Legality:")
ethical_guidelines = {
    "✓ Check robots.txt": "Always check website's robots.txt file",
    "✓ Respect rate limits": "Don't overload servers with requests", 
    "✓ Read Terms of Service": "Understand website's usage policies",
    "✓ Public data only": "Only scrape publicly available information",
    "✓ Give attribution": "Credit data sources appropriately",
    "⚠️  User-Agent headers": "Identify your scraper appropriately",
    "⚠️  Handle errors gracefully": "Don't break on failed requests"
}

for guideline, description in ethical_guidelines.items():
    print(f"   {guideline}: {description}")

print(f"\n3. HTTP Basics for Web Scraping:")
http_concepts = {
    "GET Request": "Retrieve data from server",
    "POST Request": "Send data to server", 
    "Headers": "Metadata about the request/response",
    "Status Codes": "200 (OK), 404 (Not Found), 403 (Forbidden), 500 (Server Error)",
    "Cookies": "Store session information",
    "User-Agent": "Identifies the client making the request"
}

for concept, description in http_concepts.items():
    print(f"   {concept}: {description}")

# Basic HTTP request example
print(f"\n4. Making basic HTTP requests:")

# Example: Get a simple webpage (using httpbin for testing)
test_url = "https://httpbin.org/html"

try:
    response = requests.get(test_url)
    print(f"   Request to {test_url}:")
    print(f"   Status Code: {response.status_code}")
    print(f"   Content Type: {response.headers.get('content-type')}")
    print(f"   Response Length: {len(response.text)} characters")
    print(f"   First 200 characters: {response.text[:200]}...")
except requests.RequestException as e:
    print(f"   Error making request: {e}")

# Headers examination
print(f"\n5. HTTP Headers:")
print(f"   Request Headers (what we send):")
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
}

for header, value in headers.items():
    print(f"     {header}: {value[:50]}{'...' if len(value) > 50 else ''}")

try:
    response_with_headers = requests.get(test_url, headers=headers)
    print(f"\n   Response Headers (what we receive):")
    for header, value in list(response_with_headers.headers.items())[:5]:
        print(f"     {header}: {value}")
except:
    print(f"   Could not fetch headers")

print(f"\n6. Common Web Scraping Libraries:")
libraries = {
    "requests": "HTTP library for making web requests",
    "urllib": "Built-in Python URL handling",
    "BeautifulSoup": "HTML/XML parsing and navigation", 
    "lxml": "Fast XML/HTML parser",
    "selenium": "Browser automation for JavaScript-heavy sites",
    "scrapy": "Framework for large-scale web scraping",
    "pandas": "Data manipulation and CSV/JSON handling"
}

for library, description in libraries.items():
    print(f"   {library}: {description}")

print(f"\n7. Web Scraping Challenges:")
challenges = [
    "JavaScript-rendered content",
    "Anti-bot measures (CAPTCHAs, rate limiting)", 
    "Dynamic content and AJAX calls",
    "Session management and authentication",
    "Handling different encodings",
    "Dealing with malformed HTML",
    "Respecting robots.txt and rate limits",
    "Legal and ethical considerations"
]

for i, challenge in enumerate(challenges, 1):
    print(f"   {i}. {challenge}")

print(f"\n8. When NOT to use web scraping:")
alternatives = [
    "✓ Official APIs are available",
    "✓ Data is available in structured formats (CSV, JSON)", 
    "✓ Website prohibits scraping in Terms of Service",
    "✓ Real-time data feeds exist",
    "✓ Partner or commercial data sources available"
]

for alternative in alternatives:
    print(f"   {alternative}")

print(f"\n   Always prefer official APIs when available!")

## 2. HTML Parsing with Beautiful Soup

Extracting data from HTML using Beautiful Soup:

In [None]:
# HTML Parsing with Beautiful Soup
print("HTML Parsing with Beautiful Soup:")
print("=" * 33)

# Sample HTML for demonstration
sample_html = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample E-commerce Page</title>
    <meta charset="UTF-8">
</head>
<body>
    <header>
        <h1>Online Store</h1>
        <nav>
            <ul>
                <li><a href="/home">Home</a></li>
                <li><a href="/products">Products</a></li>
                <li><a href="/about">About</a></li>
            </ul>
        </nav>
    </header>
    
    <main>
        <div class="products">
            <div class="product" data-id="1">
                <h2 class="product-title">Laptop Computer</h2>
                <p class="price">$999.99</p>
                <p class="description">High-performance laptop for professionals</p>
                <span class="category">Electronics</span>
                <div class="rating">
                    <span class="stars">★★★★☆</span>
                    <span class="review-count">(245 reviews)</span>
                </div>
                <img src="laptop.jpg" alt="Laptop Computer">
            </div>
            
            <div class="product" data-id="2">
                <h2 class="product-title">Wireless Headphones</h2>
                <p class="price">$149.99</p>
                <p class="description">Premium noise-cancelling headphones</p>
                <span class="category">Electronics</span>
                <div class="rating">
                    <span class="stars">★★★★★</span>
                    <span class="review-count">(89 reviews)</span>
                </div>
                <img src="headphones.jpg" alt="Wireless Headphones">
            </div>
            
            <div class="product" data-id="3">
                <h2 class="product-title">Coffee Maker</h2>
                <p class="price">$79.99</p>
                <p class="description">Automatic drip coffee maker</p>
                <span class="category">Appliances</span>
                <div class="rating">
                    <span class="stars">★★★☆☆</span>
                    <span class="review-count">(156 reviews)</span>
                </div>
                <img src="coffee-maker.jpg" alt="Coffee Maker">
            </div>
        </div>
        
        <aside>
            <div class="contact-info">
                <h3>Contact Us</h3>
                <p>Email: info@store.com</p>
                <p>Phone: (555) 123-4567</p>
                <address>
                    123 Main Street<br>
                    City, State 12345
                </address>
            </div>
        </aside>
    </main>
    
    <footer>
        <p>&copy; 2024 Online Store. All rights reserved.</p>
    </footer>
</body>
</html>
"""

print("1. Creating BeautifulSoup object:")

# Parse HTML with Beautiful Soup
soup = BeautifulSoup(sample_html, 'html.parser')
print(f"   Parsed HTML document")
print(f"   Title: {soup.title.string}")
print(f"   Document type: {type(soup)}")

print(f"\n2. Basic element selection:")

# Find elements by tag
print(f"   Finding by tag:")
print(f"   First h1: {soup.find('h1').string}")
print(f"   All h2 tags: {[h2.string for h2 in soup.find_all('h2')]}")

# Find by class
print(f"\n   Finding by class:")
prices = soup.find_all(class_='price')
print(f"   All prices: {[price.string for price in prices]}")

# Find by ID (if existed)
product_by_id = soup.find('div', {'data-id': '1'})
print(f"   Product with data-id='1': {product_by_id.find(class_='product-title').string}")

print(f"\n3. CSS Selectors:")

# CSS selector examples
print(f"   Using CSS selectors:")

# Select by class
product_titles = soup.select('.product-title')
print(f"   Product titles: {[title.string for title in product_titles]}")

# Select by descendant
categories = soup.select('.product .category')
print(f"   Categories: {[cat.string for cat in categories]}")

# Select by attribute
electronics = soup.select('.product:has(.category:contains("Electronics"))')
print(f"   Electronics products found: {len(electronics)} (using CSS selectors)")

# More practical approach for filtering
electronics_products = []
for product in soup.find_all('div', class_='product'):
    category = product.find(class_='category')
    if category and 'Electronics' in category.string:
        title = product.find(class_='product-title').string
        electronics_products.append(title)

print(f"   Electronics products: {electronics_products}")

print(f"\n4. Extracting structured data:")

# Extract all products into a structured format
products_data = []

for product in soup.find_all('div', class_='product'):
    # Extract basic information
    title = product.find(class_='product-title').string
    price_text = product.find(class_='price').string
    price = float(price_text.replace('$', '').replace(',', ''))
    description = product.find(class_='description').string
    category = product.find(class_='category').string
    
    # Extract rating information
    stars_element = product.find(class_='stars')
    stars_count = len([s for s in stars_element.string if s == '★']) if stars_element else 0
    
    review_count_element = product.find(class_='review-count')
    review_count_text = review_count_element.string if review_count_element else '(0 reviews)'
    review_count = int(re.search(r'\((\d+)', review_count_text).group(1))
    
    # Extract image info
    img = product.find('img')
    image_src = img['src'] if img else None
    image_alt = img['alt'] if img else None
    
    # Get product ID
    product_id = product.get('data-id')
    
    products_data.append({
        'id': product_id,
        'title': title,
        'price': price,
        'description': description,
        'category': category,
        'stars': stars_count,
        'review_count': review_count,
        'image_src': image_src,
        'image_alt': image_alt
    })

# Display extracted data
print(f"   Extracted {len(products_data)} products:")
for product in products_data:
    print(f"   - {product['title']}: ${product['price']} ({product['stars']} stars, {product['review_count']} reviews)")

# Convert to DataFrame
products_df = pd.DataFrame(products_data)
print(f"\n   DataFrame shape: {products_df.shape}")
print(f"{products_df}")

print(f"\n5. Advanced parsing techniques:")

# Handle missing elements gracefully
print(f"   Handling missing elements:")

def safe_extract(element, selector, attribute=None):
    """Safely extract data from HTML elements"""
    try:
        found = element.select_one(selector)
        if found:
            if attribute:
                return found.get(attribute)
            else:
                return found.get_text(strip=True)
        return None
    except:
        return None

# Example of safe extraction
for product in soup.find_all('div', class_='product')[:1]:  # Just first product
    safe_title = safe_extract(product, '.product-title')
    safe_price = safe_extract(product, '.price')
    safe_missing = safe_extract(product, '.nonexistent-class')
    
    print(f"   Safe extraction example:")
    print(f"     Title: {safe_title}")
    print(f"     Price: {safe_price}")
    print(f"     Missing element: {safe_missing}")

# Extract all links
print(f"\n6. Extracting links and navigation:")

links = soup.find_all('a')
print(f"   Found {len(links)} links:")
for link in links:
    href = link.get('href')
    text = link.get_text(strip=True)
    print(f"     {text}: {href}")

# Extract contact information
print(f"\n7. Extracting contact information:")
contact_section = soup.find('div', class_='contact-info')
if contact_section:
    email = contact_section.find('p').get_text(strip=True)
    phone = contact_section.find_all('p')[1].get_text(strip=True)
    address = contact_section.find('address').get_text(strip=True)
    
    contact_info = {
        'email': email,
        'phone': phone, 
        'address': address.replace('\n', ' ').strip()
    }
    
    print(f"   Contact Information:")
    for key, value in contact_info.items():
        print(f"     {key.title()}: {value}")

print(f"\n8. Saving extracted data:")

# Save to CSV
products_df.to_csv('scraped_products.csv', index=False)
print(f"   ✓ Saved products to 'scraped_products.csv'")

# Save to JSON
with open('scraped_products.json', 'w') as f:
    json.dump(products_data, f, indent=2)
print(f"   ✓ Saved products to 'scraped_products.json'")

print(f"\n9. Beautiful Soup best practices:")
print(f"   ✓ Always specify a parser ('html.parser', 'lxml', 'html5lib')")
print(f"   ✓ Handle missing elements gracefully with try/except")
print(f"   ✓ Use CSS selectors for complex selections")
print(f"   ✓ Extract data into structured formats (dicts, DataFrames)")
print(f"   ✓ Clean and validate extracted data")
print(f"   ✓ Use get_text(strip=True) to clean whitespace")
print(f"   ✓ Check for None values before accessing attributes")

## 3. Real-World Scraping Examples

Practical web scraping scenarios and techniques:

In [None]:
# Real-World Scraping Examples
print("Real-World Web Scraping Examples:")
print("=" * 34)

print("1. Scraping public APIs and JSON data:")

# Example: Using a public API that returns JSON
# JSONPlaceholder - a free fake API for testing
api_url = "https://jsonplaceholder.typicode.com/posts"

try:
    api_response = requests.get(api_url)
    
    if api_response.status_code == 200:
        posts_data = api_response.json()  # Parse JSON directly
        print(f"   Successfully fetched {len(posts_data)} posts from API")
        print(f"   First post: {posts_data[0]['title']}")
        
        # Convert to DataFrame
        posts_df = pd.DataFrame(posts_data)
        print(f"   DataFrame shape: {posts_df.shape}")
        print(f"   Columns: {list(posts_df.columns)}")
        
        # Show sample data
        print(f"\n   Sample posts:")
        for post in posts_data[:3]:
            print(f"     ID {post['id']}: {post['title'][:50]}...")
    else:
        print(f"   API request failed with status: {api_response.status_code}")
        
except requests.RequestException as e:
    print(f"   Error accessing API: {e}")

print(f"\n2. Handling different content types:")

# Example of handling different response types
test_endpoints = {
    "JSON": "https://jsonplaceholder.typicode.com/users/1",
    "XML": "https://httpbin.org/xml",
    "HTML": "https://httpbin.org/html"
}

for content_type, url in test_endpoints.items():
    try:
        response = requests.get(url, timeout=10)
        content_type_header = response.headers.get('content-type', 'unknown')
        
        print(f"   {content_type} Content:")
        print(f"     Status: {response.status_code}")
        print(f"     Content-Type: {content_type_header}")
        print(f"     Length: {len(response.text)} characters")
        
        if content_type == "JSON" and response.status_code == 200:
            try:
                json_data = response.json()
                print(f"     Parsed JSON keys: {list(json_data.keys()) if isinstance(json_data, dict) else 'List data'}")
            except json.JSONDecodeError:
                print(f"     Failed to parse JSON")
                
    except requests.RequestException as e:
        print(f"   Error with {content_type}: {e}")

print(f"\n3. Session management and cookies:")

# Using sessions for persistent connections
session = requests.Session()

# Set default headers for the session
session.headers.update({
    'User-Agent': 'Python Web Scraper 1.0',
    'Accept': 'text/html,application/json'
})

print(f"   Created session with default headers")
print(f"   Session headers: {dict(session.headers)}")

# Example of making multiple requests with same session
try:
    # First request
    response1 = session.get("https://httpbin.org/cookies/set/session_id/12345")
    print(f"   First request status: {response1.status_code}")
    
    # Second request - cookies should be maintained
    response2 = session.get("https://httpbin.org/cookies")
    if response2.status_code == 200:
        cookies_data = response2.json()
        print(f"   Cookies maintained: {cookies_data.get('cookies', {})}")
        
except requests.RequestException as e:
    print(f"   Session example error: {e}")

print(f"\n4. Handling forms and POST requests:")

# Example of submitting form data
form_data = {
    'name': 'John Doe',
    'email': 'john@example.com',
    'message': 'Hello from Python web scraper!'
}

try:
    # POST request with form data
    post_response = requests.post("https://httpbin.org/post", data=form_data)
    
    if post_response.status_code == 200:
        response_data = post_response.json()
        submitted_form = response_data.get('form', {})
        print(f"   Form submission successful")
        print(f"   Submitted data: {submitted_form}")
        
except requests.RequestException as e:
    print(f"   Form submission error: {e}")

print(f"\n5. Rate limiting and respectful scraping:")

class RespectfulScraper:
    """A scraper that implements rate limiting and retry logic"""
    
    def __init__(self, delay=1.0, max_retries=3):
        self.delay = delay
        self.max_retries = max_retries
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Educational Python Scraper 1.0'
        })
        
    def get_page(self, url):
        """Get a page with rate limiting and retry logic"""
        for attempt in range(self.max_retries):
            try:
                # Rate limiting
                time.sleep(self.delay)
                
                response = self.session.get(url, timeout=10)
                
                if response.status_code == 200:
                    return response
                elif response.status_code == 429:  # Too Many Requests
                    print(f"     Rate limited, waiting longer...")
                    time.sleep(self.delay * 3)
                    continue
                else:
                    print(f"     HTTP {response.status_code} on attempt {attempt + 1}")
                    
            except requests.RequestException as e:
                print(f"     Request failed on attempt {attempt + 1}: {e}")
                if attempt < self.max_retries - 1:
                    time.sleep(self.delay * 2)
                    
        return None
    
    def scrape_multiple_pages(self, urls):
        """Scrape multiple pages respectfully"""
        results = []
        
        for i, url in enumerate(urls):
            print(f"   Scraping page {i+1}/{len(urls)}: {url}")
            
            response = self.get_page(url)
            if response:
                results.append({
                    'url': url,
                    'status_code': response.status_code,
                    'content_length': len(response.text),
                    'title': self.extract_title(response.text)
                })
            else:
                results.append({
                    'url': url,
                    'status_code': None,
                    'content_length': 0,
                    'title': 'Failed to fetch'
                })
                
        return results
    
    def extract_title(self, html_content):
        """Extract page title from HTML"""
        try:
            soup = BeautifulSoup(html_content, 'html.parser')
            title_tag = soup.find('title')
            return title_tag.string.strip() if title_tag else 'No title found'
        except:
            return 'Failed to parse title'

# Demonstrate respectful scraping
scraper = RespectfulScraper(delay=0.5, max_retries=2)

# Test URLs
test_urls = [
    "https://httpbin.org/html",
    "https://httpbin.org/delay/1",
    "https://httpbin.org/status/200"
]

print(f"   Testing respectful scraper:")
scraping_results = scraper.scrape_multiple_pages(test_urls[:2])  # Limit for demo

for result in scraping_results:
    print(f"     {result['url']}: {result['status_code']} - {result['title']}")

print(f"\n6. Error handling and robustness:")

def robust_scrape(url, max_attempts=3):
    """Demonstrate robust error handling"""
    
    for attempt in range(max_attempts):
        try:
            response = requests.get(url, timeout=5)
            
            # Check status code
            if response.status_code != 200:
                raise requests.HTTPError(f"HTTP {response.status_code}")
            
            # Check content type
            content_type = response.headers.get('content-type', '')
            if 'text/html' not in content_type and 'application/json' not in content_type:
                raise ValueError(f"Unexpected content type: {content_type}")
            
            # Try to parse
            soup = BeautifulSoup(response.text, 'html.parser')
            
            return {
                'success': True,
                'title': soup.title.string if soup.title else 'No title',
                'text_length': len(response.text),
                'attempt': attempt + 1
            }
            
        except requests.Timeout:
            print(f"     Attempt {attempt + 1}: Timeout")
        except requests.ConnectionError:
            print(f"     Attempt {attempt + 1}: Connection error")
        except requests.HTTPError as e:
            print(f"     Attempt {attempt + 1}: HTTP error - {e}")
        except ValueError as e:
            print(f"     Attempt {attempt + 1}: Content error - {e}")
        except Exception as e:
            print(f"     Attempt {attempt + 1}: Unexpected error - {e}")
        
        if attempt < max_attempts - 1:
            time.sleep(1)
    
    return {
        'success': False,
        'error': 'All attempts failed',
        'attempts': max_attempts
    }

# Test robust scraping
print(f"   Testing robust error handling:")
test_result = robust_scrape("https://httpbin.org/html")
print(f"   Result: {test_result}")

print(f"\n7. Data export and storage:")

# Create sample scraped data
scraped_data = {
    'pages_scraped': len(scraping_results),
    'timestamp': pd.Timestamp.now().isoformat(),
    'results': scraping_results
}

# Save in multiple formats
print(f"   Saving scraped data in multiple formats:")

# JSON export
with open('scraping_results.json', 'w') as f:
    json.dump(scraped_data, f, indent=2, default=str)
print(f"     ✓ Saved to scraping_results.json")

# CSV export (flattened data)
if scraping_results:
    results_df = pd.DataFrame(scraping_results)
    results_df.to_csv('scraping_results.csv', index=False)
    print(f"     ✓ Saved to scraping_results.csv")

print(f"\n8. Web scraping best practices summary:")
best_practices = [
    "✓ Always check robots.txt before scraping",
    "✓ Use appropriate delays between requests", 
    "✓ Handle errors gracefully with try/except blocks",
    "✓ Use sessions for multiple requests to same site",
    "✓ Set proper User-Agent headers",
    "✓ Respect rate limits and avoid overwhelming servers",
    "✓ Cache responses when possible to reduce requests",
    "✓ Validate and clean extracted data",
    "✓ Store data in structured formats",
    "✓ Monitor for website changes that break scrapers"
]

for practice in best_practices:
    print(f"   {practice}")

## Summary

In this notebook, you learned about:

✅ **Web Scraping Fundamentals**: HTTP requests, HTML structure, ethics[7][23]  
✅ **Requests Library**: Making HTTP requests, handling responses, sessions  
✅ **Beautiful Soup**: HTML parsing, element selection, data extraction[24][27]  
✅ **Data Collection**: Structured data extraction and storage formats  
✅ **Best Practices**: Rate limiting, error handling, respectful scraping[7][10]  
✅ **Real-World Examples**: APIs, forms, multiple content types  

### Key Takeaways:
1. Web scraping is powerful but must be done ethically and legally[7][23]
2. Always prefer official APIs when available over scraping
3. Requests and Beautiful Soup provide excellent tools for most scraping needs[24]
4. Rate limiting and error handling are crucial for robust scrapers[10]
5. Structure extracted data for easy analysis and storage
6. Test scrapers thoroughly and handle edge cases gracefully

## 🎉 **Complete Python Journey Achieved!**

You've now mastered **30 comprehensive topics** covering:
- **Python Fundamentals** (1-19): Core language, OOP, file handling
- **Advanced Python** (20-24): Comprehensions, decorators, generators, regex  
- **Data Science Stack** (25-27): NumPy, Pandas, Matplotlib
- **Specialized Skills** (28-30): Seaborn, Machine Learning, Web Scraping

You're now equipped with production-ready Python skills for data science, web development, automation, and beyond! 🐍🚀