DiffScraper is a data extraction framework, which aims to reduce the time and complexity of writing a crawling script.
- To infer a template quickly from a set of similar documents that are generated by a server-side script
- To automatically synthesize a crawling script by choosing a proper selector
- To compress the similar documents without data loss
- Decompression is super fast -- just a string-join operation
- Efficient data storage by clustering the similar documents.
Showcases (to be updated)
- nose -- Unit testing
- colorlogs -- Advanced logging library
- prompt_toolkit -- Interactive CUI (Character User Interface)
How to run?
python3 -m diffscraper.main
Recipes (to be updated)
- At this moment, the text encoding of input documents must be UTF-8.