# üï∏Ô∏è Web Scraping with Multithreading in Python

---

This notebook demonstrates a real-world use case of **multithreading** for I/O-bound tasks ‚Äî specifically, **web scraping**.

When scraping multiple web pages, network requests can be slow. Instead of waiting for each page to download sequentially, we can use **threads** to fetch them concurrently, improving efficiency.


In [2]:
!pip install requests
!pip install beautifulsoup4


Collecting requests
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting charset_normalizer<4,>=2 (from requests)
  Downloading charset_normalizer-3.4.4-cp312-cp312-win_amd64.whl.metadata (38 kB)
Collecting idna<4,>=2.5 (from requests)
  Downloading idna-3.11-py3-none-any.whl.metadata (8.4 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Using cached urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)
Collecting certifi>=2017.4.17 (from requests)
  Using cached certifi-2025.10.5-py3-none-any.whl.metadata (2.5 kB)
Using cached requests-2.32.5-py3-none-any.whl (64 kB)
Downloading charset_normalizer-3.4.4-cp312-cp312-win_amd64.whl (107 kB)
Downloading idna-3.11-py3-none-any.whl (71 kB)
Using cached urllib3-2.5.0-py3-none-any.whl (129 kB)
Using cached certifi-2025.10.5-py3-none-any.whl (163 kB)
Installing collected packages: urllib3, idna, charset_normalizer, certifi, requests

   ---------------------------------------- 0/5 [urllib3]
   ----------------------------

In [3]:
import threading
import requests
from bs4 import BeautifulSoup
import time

# List of URLs to scrape
urls = [
    "https://python.langchain.com/v0.2/docs/introduction/",
    "https://python.langchain.com/v0.2/docs/concepts/",
    "https://python.langchain.com/v0.2/docs/tutorials/",
]

def fetch_content(url):
    """Fetch and parse content from a single URL."""
    try:
        print(f"Starting download: {url}")
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # raise error for bad responses
        soup = BeautifulSoup(response.content, "html.parser")
        print(f"‚úÖ Fetched {len(soup.text)} characters from {url}")
    except requests.exceptions.RequestException as e:
        print(f"‚ùå Error fetching {url}: {e}")

def main():
    start_time = time.time()

    # Create and start threads
    threads = []
    for url in urls:
        thread = threading.Thread(target=fetch_content, args=(url,))
        threads.append(thread)
        thread.start()

    # Wait for all threads to complete
    for thread in threads:
        thread.join()

    print(f"\nAll web pages fetched in {time.time() - start_time:.2f} seconds")

if __name__ == "__main__":
    main()


Starting download: https://python.langchain.com/v0.2/docs/introduction/
Starting download: https://python.langchain.com/v0.2/docs/concepts/
Starting download: https://python.langchain.com/v0.2/docs/tutorials/
‚úÖ Fetched 3814 characters from https://python.langchain.com/v0.2/docs/tutorials/
‚úÖ Fetched 3814 characters from https://python.langchain.com/v0.2/docs/concepts/
‚úÖ Fetched 3814 characters from https://python.langchain.com/v0.2/docs/introduction/

All web pages fetched in 1.99 seconds


### ‚úÖ Output

After running the cell above, you‚Äôll see concurrent downloads and timing showing how multithreading speeds up I/O-bound operations.


### ‚ö° ThreadPoolExecutor Alternative

Instead of manually managing threads, Python‚Äôs `concurrent.futures.ThreadPoolExecutor` provides a simpler API for multithreading:


In [4]:
from concurrent.futures import ThreadPoolExecutor
import requests
from bs4 import BeautifulSoup
import time

urls = [
    "https://python.langchain.com/v0.2/docs/introduction/",
    "https://python.langchain.com/v0.2/docs/concepts/",
    "https://python.langchain.com/v0.2/docs/tutorials/",
]

def fetch_content(url):
    try:
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.content, "html.parser")
        print(f"Fetched {len(soup.text)} characters from {url}")
    except Exception as e:
        print(f"Error: {e}")

start = time.time()
with ThreadPoolExecutor(max_workers=5) as executor:
    executor.map(fetch_content, urls)

print(f"\nAll pages fetched in {time.time() - start:.2f} seconds")


Fetched 3814 characters from https://python.langchain.com/v0.2/docs/concepts/
Fetched 3814 characters from https://python.langchain.com/v0.2/docs/introduction/
Fetched 3814 characters from https://python.langchain.com/v0.2/docs/tutorials/

All pages fetched in 2.01 seconds
