Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
branch: master
Fetching contributors…

Cannot retrieve contributors at this time

95 lines (53 sloc) 2.388 kB

Heroshi documentation

Heroshi is a web crawler.

The goal of the project is to build very fast, distributed web spider.

Current status:

  • low level HTTP client – alpha, usable for test runs
  • URL queue – not started

Project is under heavy development, so expect big changes.

Download

Heroshi source code is hosted on Github, so you may use either

  • go get/install:

    go install github.com/temoto/heroshi/heroshi-worker
    
  • clone repository to hack around:

    git clone git://github.com/temoto/heroshi.git
    
  • or download latest Heroshi master tarball.

Identity

Heroshi identifies itself with:

User-Agent: HeroshiBot/version (+http://temoto.github.com/heroshi/; temotor@gmail.com)

Load problems

Heroshi worker doesn't open more than 1 concurrent connection to each domain:port. This is a very low load to properly configured websites but the world is not perfect, and it may hurt legacy installations.

Heroshi was not meant to be a harm tool, it will not abuse your servers again and again continuously. Instead, it will wait for some time before visiting same pages again.

So far i believe i'm the only one who runs Heroshi, so if it loads your website too much, there is no need to ban User-agent/IP or something, just contact me, and i'll set up as low limit for your website/domain/IP, as acceptable.

Robots.txt support

Heroshi obeys standard robots.txt rules. As implemented by Go robots.txt library.

To completely disallow Heroshi crawl your site, place the following lines into file, accessible as /robots.txt on your site:

User-agent: HeroshiBot
Disallow: /

Contact information

Use this email (XMPP/Jabber too) for questions/demands/reports about Heroshi: temotor@gmail.com

License

Heroshi is made available under the terms of the open source MIT license.

Contents

Indices and tables

Jump to Line
Something went wrong with that request. Please try again.