This project demonstrates the use of parallel programming techniques to optimize web scraping in Python. The web scraper fetches titles from web pages using three different methods:
- Basic Scraper: A serial (non-parallel) approach.
- Threading Scraper: Uses Python’s
threadingmodule to perform parallel requests. - Async Scraper: Utilizes
asyncioandaiohttpfor asynchronous web scraping.
The goal is to compare the performance of these methods and understand how threading and asynchronous programming can improve efficiency in I/O-bound tasks.
-
Python: Make sure Python 3.8 or later is installed on your system. You can check this by running:
python --version
If Python is not installed, download it from python.org and follow the installation instructions.
-
Pip: Ensure that
pip(Python package installer) is installed and updated. You can check this by running:pip --version
If
pipis not installed, refer to the pip installation guide
- Clone the repository:
git clone https://github.com/rohaankhalid/parallel-web-scraper cd parallel_web_scraper - Set up a virtual environment:
python -m venv venv
- Activate the virtual environment:
- On Windows:
source venv/Scripts/activate - On macOS/Linux:
source venv/bin/activate
- On Windows:
- Install dependencies:
pip install -r requirements.txt
-
Run the main script:
python main.py
-
The script will:
- Load URLs from
urls_10.txt,urls_30.txt,urls_50.txt, andurls_100.txt. - Perform scraping using the Basic, Threading, and Async methods.
- Save the results in the
results/directory and print execution times to the terminal.
- Load URLs from
-
Modifying
num_threadsandmax_connections:- To test with different numbers of threads and async connections, manually update these values in
main.pyand rerun the script.
- To test with different numbers of threads and async connections, manually update these values in
- File:
basic_scraper.py - Description: Performs web scraping serially, fetching each URL one by one using the
requestslibrary. - Use Case: Useful as a baseline for performance comparison.
- File:
threading_scraper.py - Description: Uses Python’s
threadingmodule to perform concurrent web requests. Suitable for I/O-bound tasks like web scraping. - Benefit: Reduces total execution time by parallelizing requests.
- File:
async_scraper.py - Description: Uses
asyncioandaiohttpto handle asynchronous requests efficiently. - Benefit: Further optimizes performance by allowing non-blocking I/O operations.
Below are the time ranges (in seconds) recorded for each configuration:
| URL Set | Basic Time | Threading Time (2, 5, 10, 15 threads) | Async Time (2, 5, 10, 15 connections) |
|---|---|---|---|
| urls_10.txt | 2.98 - 3.64 | 1.31 - 1.50 | 1.53 - 2.04 |
| urls_30.txt | 8.91 - 9.91 | 2.95 - 3.29 | 4.06 - 5.56 |
| urls_50.txt | 13.35 - 14.21 | 6.10 - 6.58 | 5.44 - 8.63 |
| urls_100.txt | 26.70 - 30.83 | 11.15 - 12.45 | 9.78 - 13.10 |
- Threading Performance: Generally, threading improved performance compared to the basic scraper, but the benefit diminished as the number of threads increased, likely due to system limitations and overhead.
- Async Performance: The async scraper showed significant improvements over the basic scraper but faced diminishing returns and potential issues with connection limits, as evidenced by the
ConnectionResetError. - Optimal Configuration: The best performance was often seen around
num_threads=5andmax_connections=5, beyond which the gains were marginal or inconsistent.
- ConnectionResetError: During the test with
15connections, an exception occurred, suggesting that some servers might be rejecting too many simultaneous requests.
requests: For HTTP requests in the basic scraper.aiohttp: For asynchronous requests in the async scraper.asyncio: To handle asynchronous programming.time: To measure execution times.os: For file operations.
This project adheres to ethical web scraping practices by respecting the rules specified in each website's robots.txt file. The robots.txt file is a publicly accessible document located at https://www.example.com/robots.txt that outlines which parts of the website can and cannot be accessed by automated scripts.
- Implementation: Before adding any new URLs to the
urls.txtfiles, I checked therobots.txtfile of each website to ensure that the pages I intended to scrape are not disallowed. This helps to ensure compliance with the website's policies and avoids any potential legal or ethical issues. - Best Practice: The scraper also avoids sending too many requests in rapid succession to minimize the impact on the website's performance and prevent overloading the server.
By following these guidelines, the project demonstrates responsible web scraping and encourages developers to always respect website rules.
- More Robust Error Handling: Implement retries and better exception handling to deal with connection issues.
- Dynamic Load Balancing: Explore adaptive techniques to optimize the number of threads or connections based on system resources.
- Extending Functionality: Scrape additional data fields or implement support for JavaScript-heavy websites using tools like
selenium.