My attempt to build a framework for rapid and effortless development (but also maintenance) of web scrapers using knowledge I gained in the last 2 years.
The framework currently includes:
- HTTP Client
- Easy and enjoyable request building using fluent interface - build, perform and parse in a single line!
- Elastic configuration - manage client parameters easily
- Built-in interceptors for productivity:
- Add authorization headers
- Retry failed requests automatically
- Random fail rate - make sure you can handle connection problems
- Web Robot base class
- Built-in support for concurrency
- Introduction of scopes - keep track of your scraper's execution:
- Know how much time each data scope takes to scrap
- Be notified which code segments create most errors
- Logs will be enhanced with current execution scope
- Data streaming
- Support for different actions (entrypoints)
- Introduction of features - separable functionality modules which avoid tight coupled code. They make it easier to work with scope variables and allow you to define feature-related initial parameters and callbacks.
- Progress tracking - add progress bars and statistics to your robot with minimal effort
- Checkpoints - pause scraping, save checkpoint to a JSON file and resume it later. All this is a part of concurrency module and works with progress trackers out-of-box, no additional code needed!
Features I am planning to add:
- Web UI interface - view execution logs in a tree-like structure using scopes described above
- Multi-threading using Web Worker API
- Distribution of scrapers among multiple computers
- ✔
Saving and restoring scraping progress - Data processing pipeline, including helpers:
- Archive data extraction
- Easy-to-add data persistence
- ✔
Scraping progress tracking - Captcha solver (using external APIs and ML)
- HTTP Client interceptors:
- Cookie persistence (in-memory and filesystem)
- Response persistence (cache small requests)
- Rate limiter
- Automatic proxy switcher
- HTML parsing helpers:
- Auto login form recognition
- Table parser
- Pagination recognizer
- ✔
Binary search for page count
- ✔
- Element anchoring (be resistant to CSS/HTML changes)
- ✔
Conditions - only perform actions when a certain condition is met, if it's not, satisfy it
- ✔
Per-run setTimeout/setInterval feature