spiders

An easy to use web crawler for collecting text.

Features

Indefinitely crawls till end condition is met
Configurable Concurrency
SQLite based implementation for easy manual stats
Support for resuming broken operation
Configurable Regex based URL validation
You can decide which URL should be added to Queue based on custom Regex
Configurable Selector based page validation
You can decide which part of page to be scrapped based on CSS selector
Configurable Regex based URL Sanitizer
For example you can remove everything after #
Batch URL processing

Installation

If you do not have Go installed (Recommended)

Download latest binaries from release page
Copy binaries to executable PATH or run directly from terminal using ./spiders

If you have Go installed

go install github.com/therahulprasad/spiders
and run spiders from terminal

If you are a windows users

Upgrade to Linux

Usage

For help use ./spiders -h

Create a config.yaml file and run ./spiders

For using config which is not present in current directory use
./spiders -config /path/to/config.yaml

Resume previous project by running
./spiders --resume

Customization

Use self explanatory config.yaml to configure the project.

What next?

Do not download non html resourcees
support link which starts with *//*www.example.com
Create a UI
Save config data in sqlite and implement --resume with db path instead of config path, let user override parameters using CLI arguments
Add new project type for fetching pagniated API data
Handle case: When craweling is complete.
Add support for parsing set of specified tags and collecting data in json format
Automated release on Tag using CircleCI

Bugs

Ctrl + C does not work when workers are less

Change Log

v0.1
Initial Release
v0.2
v0.3
Batch Processing
v0.3.1
Config is made mandatory flag Add two parameters in config to decide how to extract text content_holder (text/attr) and content_tag_attr
v.04
batch support
attr grabbing support
Configuration format updated from json to yaml

encode.php not needed now

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.circleci		.circleci
assets		assets
crawler		crawler
.gitignore		.gitignore
DevLog.md		DevLog.md
README.md		README.md
config.yaml		config.yaml
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spiders

Features

Installation

If you do not have Go installed (Recommended)

If you have Go installed

If you are a windows users

Usage

Customization

What next?

Bugs

Change Log

About

Releases 2

Packages

Languages

therahulprasad/spiders

Folders and files

Latest commit

History

Repository files navigation

spiders

Features

Installation

If you do not have Go installed (Recommended)

If you have Go installed

If you are a windows users

Usage

Customization

What next?

Bugs

Change Log

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages