Crawlgo

Crawlgo is a crawler written in golang, it aims to be an extensible, scalable and high-performance distributed crawler system.

Using phantomjs, crawlgo can crawl web pages rendered with javascript.

Prerequisite

Linux OS
phantomjs: phantomjs should be able to run through the env PATH. It can be downloaded here.

Install

go get github.com/tossmilestone/crawlgo
cd ${GOPATH}/src/github.com/tossmilestone/crawlgo
sudo make install

The above commands will install crawlgo in ${GOPATH}/go/bin.

Usage

crawlgo [flags]

Flags:
      --download-selector string   The DOM selector to query the links that will be downloaded from the site
      --enable-profile             enable profiling the program to start a pprof HTTP server on localhost:6360
  -h, --help                       help for crawlgo
      --save-dir string            The directory to save downloaded files. (default "./crawlgo")
      --site string                The site to crawl
      --version                    version for crawlgo
      --workers int                The number of workers to run the crawl tasks. If no set, will be 'runtime.NumCPU()'

Crawlgo uses file name to identify the downloaded links. If the file of a link is existed in the save directory, the link will be assumed downloaded already.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.circleci		.circleci
.github/workflows		.github/workflows
build		build
cmd/crawlgo		cmd/crawlgo
pkg		pkg
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawlgo

Prerequisite

Install

Usage

About

Releases 1

Packages

Languages

License

tossmilestone/crawlgo

Folders and files

Latest commit

History

Repository files navigation

Crawlgo

Prerequisite

Install

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages