# Web Scraping with Multithreading

Web scraping is a perfect use case for multithreading because it's I/O-bound (waiting for network responses). Using threads can significantly speed up scraping multiple URLs.

## What We'll Learn

1. Sequential vs Concurrent Web Scraping
2. Using ThreadPoolExecutor for Scraping
3. Practical Example with Multiple URLs
4. Error Handling in Concurrent Scraping
5. Best Practices

---

## 1. Sequential Web Scraping (Slow)

Traditional approach - fetching URLs one at a time.

In [None]:
import time

# Note: This example doesn't require actual web requests
# We simulate HTTP requests to demonstrate the concept

def fetch_url_sequential(url):
    """Simulate fetching a URL (replace with actual requests.get(url))"""
    print(f"Fetching: {url}")
    time.sleep(1)  # Simulate network delay
    return f"Content from {url}"

# List of URLs to scrape
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
    "https://example.com/page4",
    "https://example.com/page5",
]

# Sequential scraping
print("=== Sequential Scraping ===")
start_time = time.time()

results = []
for url in urls:
    result = fetch_url_sequential(url)
    results.append(result)

elapsed = time.time() - start_time
print(f"\nTime taken: {elapsed:.2f} seconds")
print(f"URLs scraped: {len(results)}")

---

## 2. Concurrent Web Scraping with ThreadPoolExecutor (Fast)

Using multithreading to fetch multiple URLs simultaneously.

In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

def fetch_url_concurrent(url):
    """Simulate fetching a URL with threading"""
    print(f"Fetching: {url}")
    time.sleep(1)  # Simulate network delay
    return f"Content from {url}"

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
    "https://example.com/page4",
    "https://example.com/page5",
]

# Concurrent scraping with ThreadPoolExecutor
print("=== Concurrent Scraping with ThreadPoolExecutor ===")
start_time = time.time()

with ThreadPoolExecutor(max_workers=5) as executor:
    # Submit all tasks
    future_to_url = {executor.submit(fetch_url_concurrent, url): url for url in urls}
    
    # Collect results as they complete
    results = []
    for future in as_completed(future_to_url):
        url = future_to_url[future]
        try:
            result = future.result()
            results.append(result)
            print(f"✓ Completed: {url}")
        except Exception as e:
            print(f"✗ Error fetching {url}: {e}")

elapsed = time.time() - start_time
print(f"\nTime taken: {elapsed:.2f} seconds")
print(f"URLs scraped: {len(results)}")
print(f"Speed improvement: ~{5/elapsed:.1f}x faster!")

---

## Summary

**Key Takeaways:**

1. **Web Scraping is I/O-Bound**: Perfect for multithreading
2. **Speed Improvement**: Can be 5-10x faster with threads
3. **ThreadPoolExecutor**: Simplest approach for concurrent scraping
4. **Error Handling**: Always handle exceptions for individual URLs
5. **Respectful Scraping**:
   - Don't overwhelm servers (limit max_workers)
   - Add delays if needed
   - Respect robots.txt
   - Use User-Agent headers

**Real-World Usage:**
```python
# With actual requests library:
import requests
from concurrent.futures import ThreadPoolExecutor

def scrape_url(url):
    response = requests.get(url)
    # Parse with BeautifulSoup, extract data, etc.
    return response.text

with ThreadPoolExecutor(max_workers=10) as executor:
    results = executor.map(scrape_url, urls)
```

**Benefits:**
- Dramatically faster for multiple URLs
- Simple to implement
- Handles errors gracefully
- Scales well for I/O-bound operations

Multithreading makes web scraping practical and efficient!