Skip to content
[Crawler/Scraper for Golang]Make a Golang spider in 3 lines
Go
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github/workflows add:自动测试设置 Nov 7, 2019
.gitignore first commit Sep 8, 2019
LICENSE Create LICENSE Oct 1, 2019
README.md add:cache Oct 16, 2019
README_zh.md add:cache Oct 16, 2019
cache.go
cache_test.go add:cache Oct 16, 2019
context.go add:cache Oct 16, 2019
extensions.go Add:some comments Oct 12, 2019
extensions_test.go Add:some comments Oct 12, 2019
go.mod add:cache Oct 16, 2019
go.sum add:补充注释 Oct 7, 2019
goribot.go add:cache Oct 16, 2019
goribot_test.go fix:a bug in test Oct 13, 2019
net.go update:remove the Clone of http req head,fix #2 Nov 7, 2019
net_test.go 重构项目 Sep 30, 2019
task_queue.go add:cache Oct 16, 2019
tools.go add:cache Oct 16, 2019

README.md

Goribot

A golang spider framework.

中文文档

Codecov go-report license code-size FOSSA Status

Features

  • Clean API
  • Caching
  • Extensions
  • Pipeline-style handle logic
  • Robots.txt support (use RobotsTxt extensions)
  • Request Deduplicate (use ReqDeduplicate extensions)

Example

a basic example:

package main

import (
    "fmt"
    "github.com/zhshch2002/goribot"
)

func main() {
    s := goribot.NewSpider()
    s.NewTask(
        goribot.MustNewGetReq("https://httpbin.org/get?Goribot%20test=hello%20world"),
        func(ctx *goribot.Context) {
            fmt.Println("got resp data", ctx.Text)
        })
    s.Run()
}

a complete bilibili.com video spider example

Start to use

install

go get -u github.com/zhshch2002/goribot

basic use

create spider

s := goribot.NewSpider()

you can also init the spider by extensions,like the RandomUserAgent extension:

s := NewSpider(RandomUserAgent())

New task

create a request:

req:=goribot.MustNewGetReq("https://httpbin.org/get?Goribot%20test=hello")
// or req,err := goribot.NewGetReq("https://httpbin.org/get?Goribot%20test=hello")

// config the request
req.Header.Set("test", "test")
req.Cookie = append(req.Cookie, &http.Cookie{
    Name:  "test",
    Value: "test",
})
req.Proxy = "http://127.0.0.1:1080"

Add the request to spider task queue:

var thirdHandler func(*goribot.Context)
thirdHandler= func(ctx *goribot.Context) {
    //bu la bu la,do sth
}

s.NewTask(
    req, // the request you have created
    func(ctx *goribot.Context) {
        // first handler
        fmt.Println("got resp data", ctx.Text)
    },
    func(ctx *goribot.Context) { // you can set a group of handler func as a chain,or set same func for different request task.
    // second handler
        fmt.Println("got resp data", ctx.Text)
    },
    thirdHandler,
)

Context

Context is the only param the handler get.You can get the http response or the origin request from it,in addition you can use ctx send new request task to spider.

type Context struct {
    Text string // the response text
    Html *goquery.Document // spider will try to parse the response as html
    Json map[string]interface{} // spider will try to parse the response as json

    Request  *Request // origin request
    Response *Response // a response object

    Tasks []*Task // the new request task which will send to the spider
    Items []interface{} // the new result data which will send to the spider,use to store
    Meta  map[string]interface{} // the request task created by NewTaskWithMeta func will have a k-y pair

    drop bool // in handlers chain,you can use ctx.Drop() to break the handler chain and stop handling
}

create new task inside of handle fun or with meta data:

s.NewTaskWithMeta(MustNewGetReq("https://httpbin.org/get"), map[string]interface{}{
    "test": 1,
}, func(ctx *Context) {
    fmt.Println(ctx.Meta["test"]) // get the meta data
    
    // waring: here is the ctx.NewTaskWithMeta func rather than s.NewTaskWithMeta!
    ctx.NewTaskWithMeta(MustNewGetReq("https://httpbin.org/get"), map[string]interface{}{
        "test": 2,
    }, func(ctx *Context) {
        fmt.Println(ctx.Meta["test"]) // get the meta data
    })
})

Tip:It is different between s.NewTaskWithMeta and ctx.NewTaskWithMeta,when you use the extensions or spider hook func.

Run it!

Call the s.Run() to run the spider.

use the hook func and make extensions

wait to write.

Another Example

A bilibili video spider:

package main

import (
    "github.com/PuerkitoBio/goquery"
    "github.com/zhshch2002/goribot"
    "log"
    "strings"
)

type BiliVideoItem struct {
    Title, Url string
}

func main() {
    s := goribot.NewSpider(goribot.HostFilter("www.bilibili.com"), goribot.ReqDeduplicate(), goribot.RandomUserAgent())
    s.DepthFirst = false
    s.ThreadPoolSize = 1

    var biliVideoHandler, getNewLinkHandler func(ctx *goribot.Context)

    getNewLinkHandler = func(ctx *goribot.Context) {
        ctx.Html.Find("a[href]").Each(func(i int, selection *goquery.Selection) {
            rawurl, _ := selection.Attr("href")
            if !strings.HasPrefix(rawurl, "/video/av") {
                return
            }
            u, err := ctx.Request.Url.Parse(rawurl)
            if err != nil {
                return
            }
            u.RawQuery = ""
            if strings.HasSuffix(u.Path, "/") {
                u.Path = u.Path[0 : len(u.Path)-1]
            }
            //log.Println(u.String())
            if r, err := goribot.NewGetReq(u.String()); err == nil {
                ctx.NewTask(r, getNewLinkHandler, biliVideoHandler)
            }
        })
    }

    biliVideoHandler = func(ctx *goribot.Context) {
        ctx.AddItem(BiliVideoItem{
            Title: ctx.Html.Find("title").Text(),
            Url:   ctx.Request.Url.String(),
        })
    }

    s.NewTask(goribot.MustNewGetReq("https://www.bilibili.com/video/av66703342"), getNewLinkHandler, biliVideoHandler)
    

    s.OnItem(func(ctx *goribot.Context, i interface{}) interface{} {
        log.Println(i) // 可以做一些数据存储工作
        return i
    })

    s.Run()
}

License

FOSSA Status

You can’t perform that action at this time.