No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
diffscraper
examples
tests
.gitignore
CMakeLists.txt
LICENSE
README.md
build.sh
diffscraper.sh
install_dependencies.sh
requirements.txt
unittest.sh

README.md

DiffScraper

DiffScraper is a data extraction framework, which aims to reduce the time and complexity of writing a crawling script.

Features

  • To infer a template quickly from a set of similar documents that are generated by a server-side script
  • To automatically synthesize a crawling script by choosing a proper selector
  • To compress the similar documents without data loss
    • Decompression is super fast -- just a string-join operation
    • Efficient data storage by clustering the similar documents.

Showcases (to be updated)

Requirements

  • nose -- Unit testing
  • colorlogs -- Advanced logging library
  • prompt_toolkit -- Interactive CUI (Character User Interface)

How to run?

python3 -m diffscraper.main

Recipes (to be updated)

Troubleshooting

  • At this moment, the text encoding of input documents must be UTF-8.