### **ThreadPoolExecutor for Web Scraping**

### What is ThreadPoolExecutor?
`ThreadPoolExecutor` is a Python class in the `concurrent.futures` module that allows you to manage a pool of threads efficiently. It simplifies multithreading by allowing you to run multiple tasks concurrently, making it ideal for I/O-bound tasks like web scraping.

---

### Why Use ThreadPoolExecutor in Web Scraping?
When web scraping, most of the time is spent waiting for server responses (I/O). Using `ThreadPoolExecutor` enables you to:
- Scrape multiple pages concurrently.
- Reduce overall execution time.
- Use system resources more efficiently.

---

### Basic Example
Here’s a simple example of using `ThreadPoolExecutor` to scrape multiple URLs:

```python
from concurrent.futures import ThreadPoolExecutor
import time

def scrape_page(url):
    print(f"Scraping: {url}")
    time.sleep(2)  # Simulates a delay for I/O-bound tasks
    return f"Data from {url}"

urls = [f"https://example.com/page{i}" for i in range(1, 6)]

with ThreadPoolExecutor(max_workers=3) as executor:  # 3 worker threads
    results = list(executor.map(scrape_page, urls))

print("Scraping completed!")
```
- **`max_workers=3`**: Creates 3 threads to scrape URLs concurrently.
- **`executor.map()`**: Maps the `scrape_page` function to each URL in the list.

---

### Benefits of Using ThreadPoolExecutor
- **Concurrency**: Reduces execution time for I/O-bound tasks.
- **Simple API**: Easy to use compared to manually managing threads.
- **Scalability**: Handles many tasks efficiently by reusing threads.

---

In [1]:
""" 
Objective: Compare time execution based on basic loop
"""
import time


# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using basic loop
def main():
    # TODO: Fill main function to run io_bound_scraping 5 times
    # TODO: (Optional) Estimate the time execution
    start_time = time.time()
    
    # Run scraping 5 times sequentially
    for i in range(1, 6):
        io_bound_scraping(i)
    
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"\nTotal execution time: {execution_time:.2f} seconds")
    pass
        
# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


scraping 1 started.
scraping 1 completed.
scraping 2 started.
scraping 2 completed.
scraping 3 started.
scraping 3 completed.
scraping 4 started.
scraping 4 completed.
scraping 5 started.
scraping 5 completed.

Total execution time: 10.00 seconds


In [1]:
""" 
Objective: Compare time execution based on the number of workers
"""
import time
from concurrent.futures import ThreadPoolExecutor

# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    # TODO: Creating a ThreadPoolExecutor with 1 threads
    with ThreadPoolExecutor(max_workers=1) as executor:
        # Use map to run the scrapings concurrently
        executor.map(io_bound_scraping, range(1,5))

# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


scraping 1 started.
scraping 1 completed.
scraping 2 started.
scraping 2 completed.
scraping 3 started.
scraping 3 completed.
scraping 4 started.
scraping 4 completed.


In [None]:
""" 
Objective: Compare time execution based on the number of workers
"""
# TODO: Recreate previous code with the 2 workers


import time
from concurrent.futures import ThreadPoolExecutor

# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    start_time = time.time()
    
    # Creating a ThreadPoolExecutor with 2 workers
    with ThreadPoolExecutor(max_workers=2) as executor:
        # Use map to run the scrapings concurrently
        executor.map(io_bound_scraping, range(1, 6))
    
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"\nTotal execution time: {execution_time:.2f} seconds")

# Call the main function to run the scrapings
if __name__ == "__main__":
    main()

scraping 1 started.
scraping 2 started.
scraping 1 completed.
scraping 3 started.
scraping 2 completed.
scraping 4 started.
scraping 3 completed.
scraping 5 started.
scraping 4 completed.
scraping 5 completed.

Total execution time: 6.01 seconds


In [None]:
""" 
Objective: Compare time execution based on the number of workers
"""
# TODO: Recreate previous code with the 4 workers

import time
from concurrent.futures import ThreadPoolExecutor

# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    start_time = time.time()
    
    # Creating a ThreadPoolExecutor with 4 workers
    with ThreadPoolExecutor(max_workers=4) as executor:
        # Use map to run the scrapings concurrently
        executor.map(io_bound_scraping, range(1, 6))
    
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"\nTotal execution time: {execution_time:.2f} seconds")

# Call the main function to run the scrapings
if __name__ == "__main__":
    main()

scraping 1 started.
scraping 2 started.
scraping 3 started.
scraping 4 started.
scraping 1 completed.
scraping 5 started.
scraping 2 completed.
scraping 3 completed.
scraping 4 completed.
scraping 5 completed.

Total execution time: 4.01 seconds


In [None]:
""" 
Objective: Compare time execution based on the number of workers
"""
# TODO: Recreate previous code with the 500 workers for 1000
# TODO: Analyze how your program manage to execute 500 workers at once
import time
from concurrent.futures import ThreadPoolExecutor
import psutil
import os

# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    start_time = time.time()
    
    # Get initial CPU and memory usage
    initial_cpu = psutil.cpu_percent()
    initial_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024  # MB
    
    # Creating a ThreadPoolExecutor with 500 workers
    with ThreadPoolExecutor(max_workers=500) as executor:
        # Use map to run 1000 scraping tasks concurrently
        executor.map(io_bound_scraping, range(1, 1001))
    
    # Get final CPU and memory usage
    final_cpu = psutil.cpu_percent()
    final_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024  # MB
    
    end_time = time.time()
    execution_time = end_time - start_time
    
    print(f"\nPerformance Analysis:")
    print(f"Total execution time: {execution_time:.2f} seconds")
    print(f"CPU Usage: {final_cpu - initial_cpu:.1f}% increase")
    print(f"Memory Usage: {final_memory - initial_memory:.1f}MB increase")
    print(f"\nThread Pool Analysis:")
    print(f"- Number of workers: 500")
    print(f"- Number of tasks: 1000")
    print(f"- Average time per task: {execution_time/1000:.3f} seconds")

# Call the main function to run the scrapings
if __name__ == "__main__":
    main()

scraping 1 started.
scraping 2 started.
scraping 3 started.
scraping 4 started.
scraping 5 started.
scraping 6 started.
scraping 7 started.
scraping 8 started.
scraping 9 started.
scraping 10 started.
scraping 11 started.
scraping 12 started.
scraping 13 started.
scraping 14 started.
scraping 15 started.
scraping 16 started.
scraping 17 started.
scraping 18 started.
scraping 19 started.
scraping 20 started.
scraping 21 started.
scraping 22 started.
scraping 23 started.
scraping 24 started.
scraping 25 started.
scraping 26 started.
scraping 27 started.
scraping 28 started.
scraping 29 started.
scraping 30 started.
scraping 31 started.
scraping 32 started.
scraping 33 started.
scraping 34 started.
scraping 35 started.
scraping 36 started.
scraping 37 started.
scraping 38 started.
scraping 39 started.
scraping 40 started.
scraping 41 started.
scraping 42 started.
scraping 43 started.
scraping 44 started.
scraping 45 started.
scraping 46 started.
scraping 47 started.
scraping 48 started.
s

In [None]:
""" 
Objective: Concurrently run function with 2 parameters or more
"""
# TODO: Import necessary package
# TODO: Create a function to simulate I/O bound task with 2 parameters: task_id and delay time
# TODO: Create list of task_id and delay time
# TODO: Run your function with multi-threading by mapping your function with all the parameters

import time
from concurrent.futures import ThreadPoolExecutor
from itertools import product

# Function with multiple parameters
def io_bound_task(task_id, delay):
    print(f"Task {task_id} started with {delay}s delay")
    time.sleep(delay)  # Simulate I/O operation
    print(f"Task {task_id} completed")
    return f"Result from task {task_id}"

def main():
    start_time = time.time()
    
    # Create lists of parameters
    task_ids = range(1, 6)  # Tasks 1-5
    delay_times = [1, 2, 3]  # Different delay times
    
    # Create all combinations of parameters
    task_params = list(product(task_ids, delay_times))
    
    # Run tasks concurrently with ThreadPoolExecutor
    with ThreadPoolExecutor(max_workers=3) as executor:
        # Use starmap to pass multiple parameters
        results = list(executor.map(lambda p: io_bound_task(*p), task_params))
    
    end_time = time.time()
    print(f"\nTotal execution time: {end_time - start_time:.2f} seconds")

if __name__ == "__main__":
    main()

Task 1 started with 1s delay
Task 1 started with 2s delay
Task 1 started with 3s delay
Task 1 completed
Task 2 started with 1s delay
Task 2 completed
Task 2 started with 2s delay
Task 1 completed
Task 2 started with 3s delay
Task 1 completed
Task 3 started with 1s delay
Task 2 completed
Task 3 started with 2s delay
Task 3 completed
Task 3 started with 3s delay
Task 2 completed
Task 4 started with 1s delay
Task 3 completed
Task 4 started with 2s delay
Task 4 completed
Task 4 started with 3s delay
Task 3 completed
Task 5 started with 1s delay
Task 4 completed
Task 5 started with 2s delay
Task 5 completed
Task 5 started with 3s delay
Task 4 completed
Task 5 completed
Task 5 completed

Total execution time: 11.06 seconds


In [6]:
"""
Homework Assignment: Improve previous code. 
Instead of creating a list of delay time, combine the list of task_id with a constant value of delay time
using lambda
"""

import time
from concurrent.futures import ThreadPoolExecutor

# Function with multiple parameters
def io_bound_task(task_id, delay):
    print(f"Task {task_id} started with {delay}s delay")
    time.sleep(delay)  # Simulate I/O operation
    print(f"Task {task_id} completed")
    return f"Result from task {task_id}"

def main():
    start_time = time.time()
    
    # Create list of task IDs and constant delay
    task_ids = range(1, 6)  # Tasks 1-5
    DELAY_TIME = 2  # Constant delay time
    
    # Run tasks concurrently with ThreadPoolExecutor
    with ThreadPoolExecutor(max_workers=3) as executor:
        # Use lambda to combine task_id with constant delay
        results = list(executor.map(
            lambda x: io_bound_task(x, DELAY_TIME), 
            task_ids
        ))
    
    end_time = time.time()
    print(f"\nTotal execution time: {end_time - start_time:.2f} seconds")

if __name__ == "__main__":
    main()

Task 1 started with 2s delay
Task 2 started with 2s delay
Task 3 started with 2s delay
Task 1 completed
Task 4 started with 2s delay
Task 3 completed
Task 5 started with 2s delay
Task 2 completed
Task 4 completed
Task 5 completed

Total execution time: 4.01 seconds


In [7]:
""" 
Objective: Implement multi-threading in web scraping
"""
# TODO: Implement multi-threading on your bookstoscrape project inside new branch
# TODO: Put github url here for grading

import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
from urllib.parse import urljoin

def scrape_book(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        title = soup.find('h1').text
        price = soup.find('p', class_='price_color').text
        availability = soup.find('p', class_='availability').text.strip()
        rating = soup.find('p', class_='star-rating')['class'][1]
        
        return {
            'title': title,
            'price': price,
            'availability': availability,
            'rating': rating,
            'url': url
        }
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

def main():
    start_time = time.time()
    base_url = 'http://books.toscrape.com/catalogue/'
    
    # Get all book URLs from the main page
    response = requests.get('http://books.toscrape.com')
    soup = BeautifulSoup(response.text, 'html.parser')
    book_links = soup.select('h3 a')
    book_urls = [urljoin(base_url, link['href']) for link in book_links]
    
    # Use ThreadPoolExecutor to scrape books concurrently
    with ThreadPoolExecutor(max_workers=10) as executor:
        results = list(executor.map(scrape_book, book_urls))
    
    # Filter out None results and create DataFrame
    results = [r for r in results if r is not None]
    df = pd.DataFrame(results)
    
    # Save to CSV
    df.to_csv('books_data.csv', index=False)
    
    end_time = time.time()
    print(f"\nTotal execution time: {end_time - start_time:.2f} seconds")
    print(f"Successfully scraped {len(results)} books")

if __name__ == "__main__":
    main()

# GitHub Repository URL: [Your GitHub URL here]

Error scraping http://books.toscrape.com/catalogue/catalogue/sapiens-a-brief-history-of-humankind_996/index.html: 'NoneType' object has no attribute 'text'Error scraping http://books.toscrape.com/catalogue/catalogue/the-black-maria_991/index.html: 'NoneType' object has no attribute 'text'
Error scraping http://books.toscrape.com/catalogue/catalogue/tipping-the-velvet_999/index.html: 'NoneType' object has no attribute 'text'
Error scraping http://books.toscrape.com/catalogue/catalogue/the-requiem-red_995/index.html: 'NoneType' object has no attribute 'text'
Error scraping http://books.toscrape.com/catalogue/catalogue/soumission_998/index.html: 'NoneType' object has no attribute 'text'

Error scraping http://books.toscrape.com/catalogue/catalogue/sharp-objects_997/index.html: 'NoneType' object has no attribute 'text'
Error scraping http://books.toscrape.com/catalogue/catalogue/a-light-in-the-attic_1000/index.html: 'NoneType' object has no attribute 'text'
Error scraping http://books.tosc

In [None]:
""" 
Objective: Implement multi-threading in web scraping
"""
# TODO: Find any news site that you like: Tribun, Detik, BBC, nytimes, etc
# TODO: Extract data from the site in CSV
# TODO: Push on github and put the link here

import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
from datetime import datetime

def scrape_article(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # BBC News specific selectors
        title = soup.find('h1').text.strip()
        timestamp = soup.find('time')['datetime']
        content = ' '.join([p.text.strip() for p in soup.select('article p')])
        
        return {
            'title': title,
            'timestamp': timestamp,
            'content': content,
            'url': url
        }
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

def get_article_urls(base_url, num_pages=5):
    article_urls = []
    try:
        response = requests.get(base_url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # BBC News specific selector for article links
        links = soup.select('a[href*="/news/"]')
        for link in links:
            url = link.get('href')
            if url and url.startswith('/'):
                full_url = 'https://www.bbc.com' + url
                if full_url not in article_urls:
                    article_urls.append(full_url)
                    if len(article_urls) >= num_pages:
                        break
    except Exception as e:
        print(f"Error getting article URLs: {e}")
    
    return article_urls

def main():
    start_time = time.time()
    
    # Get article URLs from BBC News
    base_url = 'https://www.bbc.com/news'
    article_urls = get_article_urls(base_url)
    
    print(f"Found {len(article_urls)} articles to scrape")
    
    # Use ThreadPoolExecutor to scrape articles concurrently
    with ThreadPoolExecutor(max_workers=5) as executor:
        results = list(executor.map(scrape_article, article_urls))
    
    # Filter out None results and create DataFrame
    results = [r for r in results if r is not None]
    df = pd.DataFrame(results)
    
    # Save to CSV with timestamp
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    csv_filename = f'bbc_news_{timestamp}.csv'
    df.to_csv(csv_filename, index=False, encoding='utf-8')
    
    end_time = time.time()
    print(f"\nTotal execution time: {end_time - start_time:.2f} seconds")
    print(f"Successfully scraped {len(results)} articles")
    print(f"Data saved to {csv_filename}")

if __name__ == "__main__":
    main()

# GitHub Repository URL: [Your GitHub URL here]

Found 5 articles to scrape
Error scraping https://www.bbc.com/news/topics/c2vdnvdg6xxt: 'NoneType' object is not subscriptable
Error scraping https://www.bbc.com/news/war-in-ukraine: 'NoneType' object is not subscriptable
Error scraping https://www.bbc.com/news/us-canada: 'NoneType' object is not subscriptable
Error scraping https://www.bbc.com/news/uk: 'NoneType' object is not subscriptable
Error scraping https://www.bbc.com/news/world/africa: 'NoneType' object is not subscriptable

Total execution time: 3.42 seconds
Successfully scraped 0 articles
Data saved to bbc_news_20250313_073718.csv


### **Reflection**
Monitor your resources usage while executing multi-threading, what do you think?

(answer here)

In [9]:
""" 
Objective: Implement multi-threading in web scraping
"""
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
from datetime import datetime

def scrape_article(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # BBC News specific selectors
        title = soup.find('h1').text.strip()
        timestamp = soup.find('time')['datetime']
        content = ' '.join([p.text.strip() for p in soup.select('article p')])
        
        return {
            'title': title,
            'timestamp': timestamp,
            'content': content,
            'url': url
        }
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

def get_article_urls(base_url, num_pages=5):
    article_urls = []
    try:
        response = requests.get(base_url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # BBC News specific selector for article links
        links = soup.select('a[href*="/news/"]')
        for link in links:
            url = link.get('href')
            if url and url.startswith('/'):
                full_url = 'https://www.bbc.com' + url
                if full_url not in article_urls:
                    article_urls.append(full_url)
                    if len(article_urls) >= num_pages:
                        break
    except Exception as e:
        print(f"Error getting article URLs: {e}")
    
    return article_urls

def main():
    start_time = time.time()
    
    # Get article URLs from BBC News
    base_url = 'https://www.bbc.com/news'
    article_urls = get_article_urls(base_url)
    
    print(f"Found {len(article_urls)} articles to scrape")
    
    # Use ThreadPoolExecutor to scrape articles concurrently
    with ThreadPoolExecutor(max_workers=5) as executor:
        results = list(executor.map(scrape_article, article_urls))
    
    # Filter out None results and create DataFrame
    results = [r for r in results if r is not None]
    df = pd.DataFrame(results)
    
    # Save to CSV with timestamp
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    csv_filename = f'bbc_news_{timestamp}.csv'
    df.to_csv(csv_filename, index=False, encoding='utf-8')
    
    end_time = time.time()
    print(f"\nTotal execution time: {end_time - start_time:.2f} seconds")
    print(f"Successfully scraped {len(results)} articles")
    print(f"Data saved to {csv_filename}")

if __name__ == "__main__":
    main()

# GitHub Repository URL: [Your GitHub URL here]

Found 5 articles to scrape
Error scraping https://www.bbc.com/news/war-in-ukraine: 'NoneType' object is not subscriptable
Error scraping https://www.bbc.com/news/topics/c2vdnvdg6xxt: 'NoneType' object is not subscriptable
Error scraping https://www.bbc.com/news/uk: 'NoneType' object is not subscriptable
Error scraping https://www.bbc.com/news/us-canada: 'NoneType' object is not subscriptable
Error scraping https://www.bbc.com/news/world/africa: 'NoneType' object is not subscriptable

Total execution time: 1.02 seconds
Successfully scraped 0 articles
Data saved to bbc_news_20250313_075135.csv


### **Exploration**
While Multi-threading is like adding "more engine", there is a better approach for improve scraping time. Find out about asynchronous concept and be prepared for the next class.

In [11]:
"""
Asynchronous Programming Example
"""
import asyncio
import aiohttp
import time
import nest_asyncio

# Enable nested event loops in Jupyter
nest_asyncio.apply()

async def fetch_url(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    start_time = time.time()
    urls = [f'http://example.com/page{i}' for i in range(1, 6)]
    
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session, url) for url in urls]
        await asyncio.gather(*tasks)
    
    print(f"Total execution time: {time.time() - start_time:.2f} seconds")

# Run the async code
asyncio.run(main())

Total execution time: 1.44 seconds
