Skip to content

Easy and Configurable Web crawler written in Golang

Notifications You must be signed in to change notification settings

therahulprasad/spiders

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spiders

spiders

An easy to use web crawler for collecting text.
CircleCI

Features

  1. Indefinitely crawls till end condition is met
  2. Configurable Concurrency
  3. SQLite based implementation for easy manual stats
  4. Support for resuming broken operation
  5. Configurable Regex based URL validation
    You can decide which URL should be added to Queue based on custom Regex
  6. Configurable Selector based page validation
    You can decide which part of page to be scrapped based on CSS selector
  7. Configurable Regex based URL Sanitizer
    For example you can remove everything after #
  8. Batch URL processing

Installation

If you do not have Go installed (Recommended)

  • Download latest binaries from release page
  • Copy binaries to executable PATH or run directly from terminal using ./spiders

If you have Go installed

go install github.com/therahulprasad/spiders
and run spiders from terminal

If you are a windows users

  • Upgrade to Linux

Usage

For help use ./spiders -h

Create a config.yaml file and run ./spiders

For using config which is not present in current directory use
./spiders -config /path/to/config.yaml

Resume previous project by running
./spiders --resume

Customization

Use self explanatory config.yaml to configure the project.

What next?

  • Do not download non html resourcees
  • support link which starts with *//*www.example.com
  • Create a UI
  • Save config data in sqlite and implement --resume with db path instead of config path, let user override parameters using CLI arguments
  • Add new project type for fetching pagniated API data
  • Handle case: When craweling is complete.
  • Add support for parsing set of specified tags and collecting data in json format
  • Automated release on Tag using CircleCI

Bugs

Ctrl + C does not work when workers are less

Change Log

v0.1
Initial Release
v0.2
v0.3
Batch Processing
v0.3.1
Config is made mandatory flag Add two parameters in config to decide how to extract text content_holder (text/attr) and content_tag_attr
v.04
batch support
attr grabbing support
Configuration format updated from json to yaml

  • encode.php not needed now

About

Easy and Configurable Web crawler written in Golang

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages