Skip to content

rskworld/web-scraper

Repository files navigation

Web Scraper

Project Details

Field Value
ID 33
Title Web Scraper
Category Python Projects
Description Data extraction tool with BeautifulSoup and requests.
Difficulty Intermediate
Source Link ./web-scraper/web-scraper.zip
Demo Link ./web-scraper/
Icon fas fa-spider
Icon Color text-primary
Project Image ./web-scraper/web-scraper.png
Project Image Alt Web Scraper - rskworld.in

Full Description

Build a powerful web scraper using Python with BeautifulSoup and requests libraries. Features data extraction, CSV export, and customizable scraping rules. Perfect for learning Python web scraping, data processing, and automation techniques.

Technologies

  • Python
  • BeautifulSoup
  • Requests
  • Web Scraping
  • Data Processing
  • CSV Export
  • JSON
  • Logging
  • tqdm (Progress Bars)
  • Proxy Management

Features

  • ✅ Data extraction from websites
  • ✅ CSV export functionality
  • ✅ JSON export functionality
  • ✅ Customizable scraping rules
  • ✅ Web automation
  • ✅ Data processing capabilities
  • ✅ Image downloading support
  • ✅ Proxy support for anonymized requests
  • ✅ Error handling with retry mechanism
  • ✅ Progress bar for better UX
  • ✅ Comprehensive logging system
  • ✅ Exponential backoff for retries
  • ✅ Session management
  • ✅ Custom headers support

Created By

  • Founder: Molla Samser
  • Designer & Tester: Rima Khatun
  • RSK World: Free Programming Resources & Source Code
  • Website: https://rskworld.in
  • Year: 2026

Installation

  1. Clone or download the project files
  2. Install the required dependencies:
pip install -r requirements.txt

Usage

Basic Usage

Run the main script to see a demo of the web scraper:

python web_scraper.py

This will:

  1. Scrape quotes from quotes.toscrape.com (2 pages)
  2. Display sample quotes in the console
  3. Export all quotes to quotes.csv
  4. Demonstrate custom scraping with CSS selectors
  5. Export custom scraped data to custom_quotes.csv

Custom Usage

You can modify the web_scraper.py file to create your own custom scraping logic:

from web_scraper import WebScraper

# Create a scraper instance with advanced options
scraper = WebScraper(
    'https://example.com',
    delay=2,  # Add delay between requests
    retries=5,  # Retry failed requests
    log_level='INFO',  # Enable logging
    log_file='scraper.log'  # Log to file
)

# Define custom CSS selectors
custom_selectors = {
    'item': 'div.product',
    'title': 'h2.product-title',
    'price': 'span.price',
    'description': 'div.description'
}

# Scrape custom data
data = scraper.scrape_custom(custom_selectors, pages=5)

# Export to CSV
scraper.export_to_csv(data, 'products.csv')

# Export to JSON
scraper.export_to_json(data, 'products.json')

Additional Examples

Image Downloading

# Download images from a website
scraper.download_images(
    'https://example.com/gallery',
    selector='img.gallery-image',  # Custom CSS selector
    output_dir='gallery_images',  # Output directory
    limit=10  # Limit to 10 images
)

Proxy Support

# Set up proxies for anonymized requests
proxies = {
    'http': 'http://proxy.example.com:8080',
    'https': 'https://proxy.example.com:8080'
}

scraper.set_proxies(proxies)
# Continue scraping with proxies...

Advanced Logging

import logging

# Create scraper with debug logging
scraper = WebScraper(
    'https://example.com',
    log_level=logging.DEBUG,
    log_file='detailed_scraper.log'
)

Project Structure

web-scraper/
├── web_scraper.py         # Main web scraper script
├── requirements.txt       # Required dependencies
├── README.md              # Project documentation
├── scraper.log            # Log file (generated)
├── quotes.csv             # Sample output file (generated)
├── quotes.json            # Sample JSON output (generated)
├── custom_quotes.csv      # Sample custom output (generated)
├── custom_quotes.json     # Sample custom JSON output (generated)
└── images/                # Downloaded images directory (generated)
    └── *.jpg/png/gif      # Downloaded images

Features in Detail

Data Extraction

  • Extract text, links, and other data from HTML elements
  • Support for CSS selectors
  • Handle pagination automatically
  • Scrape multiple pages with ease

CSV Export

  • Export scraped data to CSV format
  • Customizable delimiters
  • UTF-8 encoding support
  • Automatic header detection from data

JSON Export

  • Export scraped data to JSON format
  • Pretty-printed output with customizable indentation
  • Support for complex nested data structures
  • UTF-8 encoding for international characters

Image Downloading

  • Download images from web pages
  • Customizable CSS selectors for targeting images
  • Automatic directory creation
  • Limit the number of images to download
  • Support for various image formats (JPG, PNG, GIF, etc.)

Customizable Scraping Rules

  • Define your own CSS selectors
  • Flexible data extraction logic
  • Support for different website structures
  • Easy to adapt to new websites

Web Automation

  • Session management for persistent connections
  • Custom headers and user agents
  • Configurable delay between requests to avoid rate limiting
  • Proxy support for anonymized browsing

Error Handling & Retry Mechanism

  • Automatic retry of failed requests
  • Exponential backoff strategy for retries
  • Configurable number of retry attempts
  • Detailed error logging

Progress Bars

  • Visual progress indicators for long-running tasks
  • Separate progress bars for different operations
  • Estimated time remaining display
  • Clean console output

Logging System

  • Configurable logging levels (DEBUG, INFO, WARNING, ERROR)
  • Both console and file logging options
  • Detailed timestamps and log levels
  • Easy debugging with comprehensive logs
  • Rotatable log files

Proxy Support

  • Configure HTTP and HTTPS proxies
  • Dynamic proxy updates during scraping
  • Support for proxy rotation
  • Anonymized requests for privacy

Data Processing

  • Clean and process scraped data
  • Structured data output in multiple formats
  • Easy to integrate with other tools and pipelines
  • Ready-to-use data for analysis or visualization

Learning Opportunities

This project is perfect for learning:

  • Python programming
  • Web scraping concepts
  • HTML parsing with BeautifulSoup
  • HTTP requests with requests library
  • Data processing and manipulation
  • File I/O operations for various formats
  • CSV file handling
  • JSON file handling
  • OOP (Object-Oriented Programming) concepts
  • Error handling and exception management
  • Retry mechanisms with exponential backoff
  • Logging systems in Python
  • Progress bar implementation
  • Proxy usage for web requests
  • Image downloading and processing
  • Session management for web scraping
  • Configuration management
  • Best practices for web scraping ethics

License

This project is open source and available for educational purposes.

Disclaimer

Please use this web scraper responsibly and respect website terms of service. Always check a website's robots.txt file before scraping and avoid overloading servers with too many requests.

Contact

For more information, visit https://rskworld.in

About

Build a powerful web scraper using Python with BeautifulSoup and requests libraries. Features data extraction, CSV export, and customizable scraping rules. Perfect for learning Python web scraping, data processing, and automation techniques.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors