This repo has been superseded by schollz/goredis-crawler

Cross-platform persistent and distributed web crawler

linkcrawler is persistent because the queue is stored in a remote database that is automatically re-initialized if interrupted. linkcrawler is distributed because multiple instances of linkcrawler will work on the remotely stored queue, so you can start as many crawlers as you want on separate machines to speed along the process. linkcrawler is also fast because it is threaded and uses connection pools.

Crawl responsibly.

This repo has been superseded by schollz/goredis-crawler

Getting Started

Install

If you have Go installed, just do

go get github.com/schollz/linkcrawler/...
go get github.com/schollz/boltdb-server/...

Otherwise, use the releases and download linkcrawler and then download the boltdb-server.

Run

Crawl a site

First run the database server which will create a LAN hub:

$ ./boltdb-server
boltdb-server running on http://X.Y.Z.W:8050

Then, to capture all the links on a website:

$ linkcrawler --server http://X.Y.Z.W:8050 crawl http://rpiai.com

Make sure to replace http://X.Y.Z.W:8050 with the IP information outputted from the boltdb-server.

You can run this last command on as many different machines as you want, which will help to crawl the respective website and add collected links to a universal queue in the server.

The current state of the crawler is saved. If the crawler is interrupted, you can simply run the command again and it will restart from the last state.

See the help (-help) if you'd like to see more options, such as exclusions/inclusions and modifying the worker pool and connection pools.

Download a site

You can also use linkcrawler to download webpages from a newline-delimited list of websites. As before, first startup a boltdb-server. Then you can run:

$ linkcrawler --server http://X.Y.Z.W:8050 download links.txt

Downloads are saved into a folder downloaded with URL of link encoded in Base32 and compressed using gzip.

Dump the current list of links

To dump the current database, just use

$ linkcrawler --server http://X.Y.Z.W:8050 dump http://rpiai.com
Wrote 32 links to NB2HI4B2F4XXE4DJMFUS4Y3PNU======.txt

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
lib		lib
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
goreleaser.yml		goreleaser.yml
logo.png		logo.png
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

.travis.yml

.travis.yml

LICENSE

LICENSE

README.md

README.md

goreleaser.yml

goreleaser.yml

logo.png

logo.png

main.go

main.go

Repository files navigation

This repo has been superseded by schollz/goredis-crawler

Getting Started

Install

Run

Crawl a site

Download a site

Dump the current list of links

License

About

Releases 1

Packages

Languages

License

schollz/linkcrawler

Folders and files

Latest commit

History

Repository files navigation

This repo has been superseded by schollz/goredis-crawler

Getting Started

Install

Run

Crawl a site

Download a site

Dump the current list of links

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages