An easy to use web crawler for collecting text.
- Indefinitely crawls till end condition is met
- Configurable Concurrency
- SQLite based implementation for easy manual stats
- Support for resuming broken operation
- Configurable Regex based URL validation
You can decide which URL should be added to Queue based on custom Regex - Configurable Selector based page validation
You can decide which part of page to be scrapped based on CSS selector - Configurable Regex based URL Sanitizer
For example you can remove everything after # - Batch URL processing
- Download latest binaries from release page
- Copy binaries to executable PATH or run directly from terminal using
./spiders
go install github.com/therahulprasad/spiders
and run spiders
from terminal
- Upgrade to Linux
For help use
./spiders -h
Create a config.yaml
file and run
./spiders
For using config which is not present in current directory use
./spiders -config /path/to/config.yaml
Resume previous project by running
./spiders --resume
Use self explanatory config.yaml
to configure the project.
- Do not download non html resourcees
- support link which starts with *//*www.example.com
- Create a UI
- Save config data in sqlite and implement --resume with db path instead of config path, let user override parameters using CLI arguments
- Add new project type for fetching pagniated API data
- Handle case: When craweling is complete.
- Add support for parsing set of specified tags and collecting data in json format
- Automated release on Tag using CircleCI
Ctrl + C
does not work when workers are less
v0.1
Initial Release
v0.2
v0.3
Batch Processing
v0.3.1
Config is made mandatory flag
Add two parameters in config to decide how to extract text content_holder
(text/attr) and content_tag_attr
v.04
batch support
attr grabbing support
Configuration format updated from json to yaml
- encode.php not needed now