A comprehensive Spring Boot web application for monitoring and aggregating PC parts listings from major Czech marketplaces, featuring automated scheduling, intelligent data deduplication, and Tor proxy support for enhanced scraping capabilities.
- π₯οΈ Multi-Marketplace Monitoring: Tracks major Czech PC parts websites (Bazos.cz, SBazar.cz, and more)
- π Interactive Dashboard: Modern web interface with real-time filtering and search capabilities
- π Automated Scheduling: Background scraping jobs with configurable intervals using Quartz Scheduler
- π‘οΈ Tor Proxy Support: Built-in Tor proxy rotation to avoid rate limiting and IP blocking
- π Smart Deduplication: Advanced algorithms to prevent duplicate entries and optimize storage
- ποΈ PC Build Baskets: Create and manage custom PC build configurations
- π― Advanced Filtering: Filter by part type, price range, marketplace, and search terms
- π± Responsive Design: Modern UI with dark mode support using Tailwind CSS + HTMX
graph TD
subgraph "User Interface"
A["Web Browser<br/>(Thymeleaf + HTMX + Tailwind)"]
end
subgraph "Backend Services"
B["Spring Boot Application<br/>(REST API & MVC)"]
G["Quartz Scheduler<br/>(Background Jobs)"]
T["TorProxyService<br/>(Proxy Rotation)"]
end
subgraph "Scraping Layer"
MS["MarketplaceService Interface"]
BS["BazosScrapingService"]
SS["SBazarScrapingService"]
NS["...NewMarketplaceService"]
end
subgraph "Data Layer"
C["PostgreSQL<br/>(Parts & Baskets)"]
QDB["Quartz Tables<br/>(Job Persistence)"]
end
subgraph "External Services"
BZ["Bazos.cz<br/>(PC Parts)"]
SB["SBazar.cz<br/>(PC Parts)"]
TR["Tor Network<br/>(Proxy Rotation)"]
end
A -- "HTTPS/Web Requests" --> B
B -- "Data Persistence" --> C
B -- "Job Scheduling" --> G
B -- "Proxy Management" --> T
G -- "Scheduled Scraping" --> MS
MS -- "Interface Implementation" --> BS
MS -- "Interface Implementation" --> SS
MS -- "Interface Implementation" --> NS
BS -- "HTTP Scraping" --> BZ
SS -- "HTTP Scraping" --> SB
T -- "SOCKS Proxy" --> TR
BS -- "Via Tor Proxy" --> T
SS -- "Via Tor Proxy" --> T
G -- "Job State" --> QDB
- Java 24 or higher
- Docker and Docker Compose
- Tor (optional, for proxy features)
-
Clone the repository
git clone https://github.com/yourusername/pcpartsscraper.git cd pcpartsscraper
-
Start the database
docker-compose up -d postgres
-
Run the application
./gradlew bootRun
-
Access the application
- Web Interface: http://localhost:8080
- API Documentation: http://localhost:8080/swagger-ui.html
- Actuator Health: http://localhost:8080/actuator/health
# Build and run with Docker Compose
docker-compose up --build
# Run in background
docker-compose up -d
The application uses PostgreSQL with optimized connection pooling:
# Database
spring.datasource.url=jdbc:postgresql://localhost:5432/pcpartsdb
spring.datasource.username=pcparts_user
spring.datasource.password=pcparts_password
# Connection Pool (HikariCP)
spring.datasource.hikari.maximum-pool-size=25
spring.datasource.hikari.minimum-idle=10
# Scraping Settings
app.scraping.enabled=true
app.scraping.bazos.interval-hours=3
app.scraping.bazos.max-concurrent-categories=5
app.scraping.bazos.duplicate-stop-threshold=0.8
# Tor Proxy (Optional)
app.tor.enabled=false
app.tor.host=127.0.0.1
app.tor.socks-port=9050
app.tor.control-port=9051
app.tor.rotation-interval=10
- Multi-threaded Processing: Concurrent scraping of multiple categories
- Intelligent Pagination: Automatic page detection and traversal
- Duplicate Prevention: SHA-256 hash-based deduplication
- Error Recovery: Robust error handling with retry mechanisms
- Rate Limiting: Configurable delays to respect target sites
- Automatic Proxy Rotation: Rotate IP addresses every N requests
- Circuit Management: Request new Tor circuits for enhanced anonymity
- Fallback Support: Graceful fallback to direct connections
- Configurable Settings: Customizable proxy settings and rotation intervals
- Comprehensive Part Data: Title, description, price, location, seller info
- Price Tracking: Support for both fixed and negotiable pricing
- Marketplace Attribution: Track source marketplace and specific site
- Temporal Data: Scraping timestamps and update tracking
- Modern Design: Clean, responsive interface with dark mode
- Real-time Filtering: HTMX-powered dynamic filtering without page reloads
- Advanced Search: Full-text search across titles and descriptions
- Pagination: Efficient pagination for large datasets
- Part Categories: Organized by CPU, GPU, RAM, Storage, and more
- Build Baskets: Create custom PC build configurations
- Price Tracking: Monitor total build costs
- Part Management: Add/remove parts from builds
- Build History: Track build modifications over time
GET /
- Dashboard with statistics and recent partsGET /parts
- Browse all parts with filteringGET /baskets
- Manage PC build baskets
GET /scraping/bazos
- Trigger full Bazos scrapingGET /scraping/bazos/{partType}
- Scrape specific category
GET /actuator/health
- Application health statusGET /actuator/scheduledtasks
- View scheduled jobsGET /actuator/quartz
- Quartz scheduler information
src/main/java/com/tadeasfort/pcpartsscraper/
βββ config/ # Configuration classes
β βββ QuartzConfig.java # Quartz scheduler configuration
β βββ WebConfig.java # Web MVC configuration
βββ controller/ # REST controllers
β βββ MainController.java # Main web interface
β βββ ScrapingController.java # Scraping API
βββ model/ # JPA entities
β βββ Part.java # PC part entity
β βββ PCBasket.java # Build basket entity
β βββ BasketItem.java # Basket item entity
βββ repository/ # Data access layer
β βββ PartRepository.java # Part repository
βββ service/ # Business logic
β βββ TorProxyService.java # Tor proxy management
β βββ scraping/ # Scraping services
β βββ MarketplaceService.java # Service interface
β βββ BazosScrapingService.java # Bazos implementation
β βββ SBazar.java # SBazar implementation
β βββ CategoryScrapingJob.java # Quartz job
βββ PCPartsScraperApplication.java # Main application
-
Create Service Class
@Service public class NewMarketplaceService implements MarketplaceService { // Implement interface methods }
-
Add Configuration
app.scraping.newmarketplace.enabled=true app.scraping.newmarketplace.interval-hours=4
-
Update Quartz Jobs
- Add category mappings to QuartzConfig
- Configure job scheduling parameters
# Run all tests
./gradlew test
# Run with coverage
./gradlew test jacocoTestReport
# Integration tests
./gradlew integrationTest
-
Environment Variables
export SPRING_PROFILES_ACTIVE=production export SPRING_DATASOURCE_URL=jdbc:postgresql://prod-db:5432/pcpartsdb export APP_TOR_ENABLED=true
-
Docker Production
docker build -t pcpartsscraper:latest . docker run -d -p 8080:8080 pcpartsscraper:latest
-
Tor Setup (Optional)
# Install Tor sudo apt-get install tor # Configure torrc echo "SocksPort 9050" >> /etc/tor/torrc echo "ControlPort 9051" >> /etc/tor/torrc # Start Tor sudo systemctl start tor
- Application health via Spring Actuator
- Database connection monitoring
- Quartz job execution tracking
- Tor proxy status monitoring
- Structured logging with SLF4J + Logback
- Configurable log levels per package
- Scraping statistics and error reporting
- Performance metrics and timing data
- Regular database cleanup of old parts
- Quartz job history maintenance
- Tor circuit rotation monitoring
- Performance optimization reviews
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- π§ Email: business@tadeasfort.com
- π Issues: GitHub Issues
- π Documentation: Wiki
Built with β€οΈ for the PC building community