Skip to content

tadeasf/pcpartsscraper

Repository files navigation

PC Parts Scraper

A comprehensive Spring Boot web application for monitoring and aggregating PC parts listings from major Czech marketplaces, featuring automated scheduling, intelligent data deduplication, and Tor proxy support for enhanced scraping capabilities.

Core Features

  • πŸ–₯️ Multi-Marketplace Monitoring: Tracks major Czech PC parts websites (Bazos.cz, SBazar.cz, and more)
  • πŸ“Š Interactive Dashboard: Modern web interface with real-time filtering and search capabilities
  • πŸ”„ Automated Scheduling: Background scraping jobs with configurable intervals using Quartz Scheduler
  • πŸ›‘οΈ Tor Proxy Support: Built-in Tor proxy rotation to avoid rate limiting and IP blocking
  • πŸ“ˆ Smart Deduplication: Advanced algorithms to prevent duplicate entries and optimize storage
  • πŸ—‚οΈ PC Build Baskets: Create and manage custom PC build configurations
  • 🎯 Advanced Filtering: Filter by part type, price range, marketplace, and search terms
  • πŸ“± Responsive Design: Modern UI with dark mode support using Tailwind CSS + HTMX

Technology Stack

Java 24 Spring Boot PostgreSQL Quartz Thymeleaf HTMX Tailwind CSS JSoup Tor

Architecture

graph TD
    subgraph "User Interface"
        A["Web Browser<br/>(Thymeleaf + HTMX + Tailwind)"]
    end

    subgraph "Backend Services"
        B["Spring Boot Application<br/>(REST API & MVC)"]
        G["Quartz Scheduler<br/>(Background Jobs)"]
        T["TorProxyService<br/>(Proxy Rotation)"]
    end

    subgraph "Scraping Layer"
        MS["MarketplaceService Interface"]
        BS["BazosScrapingService"]
        SS["SBazarScrapingService"]
        NS["...NewMarketplaceService"]
    end

    subgraph "Data Layer"
        C["PostgreSQL<br/>(Parts & Baskets)"]
        QDB["Quartz Tables<br/>(Job Persistence)"]
    end

    subgraph "External Services"
        BZ["Bazos.cz<br/>(PC Parts)"]
        SB["SBazar.cz<br/>(PC Parts)"]
        TR["Tor Network<br/>(Proxy Rotation)"]
    end

    A -- "HTTPS/Web Requests" --> B
    B -- "Data Persistence" --> C
    B -- "Job Scheduling" --> G
    B -- "Proxy Management" --> T
    G -- "Scheduled Scraping" --> MS
    MS -- "Interface Implementation" --> BS
    MS -- "Interface Implementation" --> SS
    MS -- "Interface Implementation" --> NS
    BS -- "HTTP Scraping" --> BZ
    SS -- "HTTP Scraping" --> SB
    T -- "SOCKS Proxy" --> TR
    BS -- "Via Tor Proxy" --> T
    SS -- "Via Tor Proxy" --> T
    G -- "Job State" --> QDB
Loading

Quick Start

Prerequisites

  • Java 24 or higher
  • Docker and Docker Compose
  • Tor (optional, for proxy features)

Installation

  1. Clone the repository

    git clone https://github.com/yourusername/pcpartsscraper.git
    cd pcpartsscraper
  2. Start the database

    docker-compose up -d postgres
  3. Run the application

    ./gradlew bootRun
  4. Access the application

Docker Deployment

# Build and run with Docker Compose
docker-compose up --build

# Run in background
docker-compose up -d

Configuration

Database Configuration

The application uses PostgreSQL with optimized connection pooling:

# Database
spring.datasource.url=jdbc:postgresql://localhost:5432/pcpartsdb
spring.datasource.username=pcparts_user
spring.datasource.password=pcparts_password

# Connection Pool (HikariCP)
spring.datasource.hikari.maximum-pool-size=25
spring.datasource.hikari.minimum-idle=10

Scraping Configuration

# Scraping Settings
app.scraping.enabled=true
app.scraping.bazos.interval-hours=3
app.scraping.bazos.max-concurrent-categories=5
app.scraping.bazos.duplicate-stop-threshold=0.8

Tor Proxy Configuration

# Tor Proxy (Optional)
app.tor.enabled=false
app.tor.host=127.0.0.1
app.tor.socks-port=9050
app.tor.control-port=9051
app.tor.rotation-interval=10

Features

πŸ” Advanced Scraping Engine

  • Multi-threaded Processing: Concurrent scraping of multiple categories
  • Intelligent Pagination: Automatic page detection and traversal
  • Duplicate Prevention: SHA-256 hash-based deduplication
  • Error Recovery: Robust error handling with retry mechanisms
  • Rate Limiting: Configurable delays to respect target sites

πŸ›‘οΈ Tor Proxy Integration

  • Automatic Proxy Rotation: Rotate IP addresses every N requests
  • Circuit Management: Request new Tor circuits for enhanced anonymity
  • Fallback Support: Graceful fallback to direct connections
  • Configurable Settings: Customizable proxy settings and rotation intervals

πŸ“Š Data Management

  • Comprehensive Part Data: Title, description, price, location, seller info
  • Price Tracking: Support for both fixed and negotiable pricing
  • Marketplace Attribution: Track source marketplace and specific site
  • Temporal Data: Scraping timestamps and update tracking

🎯 User Interface

  • Modern Design: Clean, responsive interface with dark mode
  • Real-time Filtering: HTMX-powered dynamic filtering without page reloads
  • Advanced Search: Full-text search across titles and descriptions
  • Pagination: Efficient pagination for large datasets
  • Part Categories: Organized by CPU, GPU, RAM, Storage, and more

πŸ—‚οΈ PC Build Management

  • Build Baskets: Create custom PC build configurations
  • Price Tracking: Monitor total build costs
  • Part Management: Add/remove parts from builds
  • Build History: Track build modifications over time

API Endpoints

Web Interface

  • GET / - Dashboard with statistics and recent parts
  • GET /parts - Browse all parts with filtering
  • GET /baskets - Manage PC build baskets

Scraping API

  • GET /scraping/bazos - Trigger full Bazos scraping
  • GET /scraping/bazos/{partType} - Scrape specific category

Monitoring

  • GET /actuator/health - Application health status
  • GET /actuator/scheduledtasks - View scheduled jobs
  • GET /actuator/quartz - Quartz scheduler information

Development

Project Structure

src/main/java/com/tadeasfort/pcpartsscraper/
β”œβ”€β”€ config/                 # Configuration classes
β”‚   β”œβ”€β”€ QuartzConfig.java   # Quartz scheduler configuration
β”‚   └── WebConfig.java      # Web MVC configuration
β”œβ”€β”€ controller/             # REST controllers
β”‚   β”œβ”€β”€ MainController.java # Main web interface
β”‚   └── ScrapingController.java # Scraping API
β”œβ”€β”€ model/                  # JPA entities
β”‚   β”œβ”€β”€ Part.java          # PC part entity
β”‚   β”œβ”€β”€ PCBasket.java      # Build basket entity
β”‚   └── BasketItem.java    # Basket item entity
β”œβ”€β”€ repository/             # Data access layer
β”‚   └── PartRepository.java # Part repository
β”œβ”€β”€ service/                # Business logic
β”‚   β”œβ”€β”€ TorProxyService.java # Tor proxy management
β”‚   └── scraping/          # Scraping services
β”‚       β”œβ”€β”€ MarketplaceService.java # Service interface
β”‚       β”œβ”€β”€ BazosScrapingService.java # Bazos implementation
β”‚       β”œβ”€β”€ SBazar.java    # SBazar implementation
β”‚       └── CategoryScrapingJob.java # Quartz job
└── PCPartsScraperApplication.java # Main application

Adding New Marketplaces

  1. Create Service Class

    @Service
    public class NewMarketplaceService implements MarketplaceService {
        // Implement interface methods
    }
  2. Add Configuration

    app.scraping.newmarketplace.enabled=true
    app.scraping.newmarketplace.interval-hours=4
  3. Update Quartz Jobs

    • Add category mappings to QuartzConfig
    • Configure job scheduling parameters

Testing

# Run all tests
./gradlew test

# Run with coverage
./gradlew test jacocoTestReport

# Integration tests
./gradlew integrationTest

Deployment

Production Configuration

  1. Environment Variables

    export SPRING_PROFILES_ACTIVE=production
    export SPRING_DATASOURCE_URL=jdbc:postgresql://prod-db:5432/pcpartsdb
    export APP_TOR_ENABLED=true
  2. Docker Production

    docker build -t pcpartsscraper:latest .
    docker run -d -p 8080:8080 pcpartsscraper:latest
  3. Tor Setup (Optional)

    # Install Tor
    sudo apt-get install tor
    
    # Configure torrc
    echo "SocksPort 9050" >> /etc/tor/torrc
    echo "ControlPort 9051" >> /etc/tor/torrc
    
    # Start Tor
    sudo systemctl start tor

Monitoring & Maintenance

Health Checks

  • Application health via Spring Actuator
  • Database connection monitoring
  • Quartz job execution tracking
  • Tor proxy status monitoring

Logging

  • Structured logging with SLF4J + Logback
  • Configurable log levels per package
  • Scraping statistics and error reporting
  • Performance metrics and timing data

Maintenance Tasks

  • Regular database cleanup of old parts
  • Quartz job history maintenance
  • Tor circuit rotation monitoring
  • Performance optimization reviews

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support


Built with ❀️ for the PC building community

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published