scraper

A dual interface Go module for building simple web scrapers

Features

Go struct-tag interface
Command-line interface
- HTML⇒JSON API server
- Single binary
- Simple configuration
- Zero-downtime config reload with kill -s SIGHUP <scraper-pid>

Install

Binaries

See the latest release or download it with this one-liner: curl https://i.jpillora.com/scraper | bash

Source

$ go get -v github.com/jpillora/scraper

Go Example

package main

import (
	"log"

	"github.com/jpillora/scraper/scraper"
)

func main() {
	type result struct {
		Title string `scraper:"h3 span"`
		URL   string `scraper:"a[href] | @href"`
	}

	type google struct {
		URL    string   `scraper:"https://www.google.com/search?q={{query}}"`
		Result []result `scraper:"#rso div[class=g]"`
		Query  string   `scraper:"query"`
	}

	g := google{Query: "hello world"}

	if err := scraper.Execute(&g); err != nil {
		log.Fatal(err)
	}

	for i, r := range g.Result {
		fmt.Printf("#%d: '%s' => %s\n", i+1, r.Title, r.URL)
	}
}

#1: 'Helloworld Travel – Deals on Accommodation, Flights ...' => https://www.helloworld.com.au/
#2: '"Hello, World!" program - Wikipedia' => https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
#3: 'Helloworld Travel - Wikipedia' => https://en.wikipedia.org/wiki/Helloworld_Travel
#4: 'Helloworld Travel Limited' => https://www.helloworldlimited.com.au/
#5: 'Total immersion, Serious fun! with Hello-World!' => https://www.hello-world.com/
#6: 'Helloworld Travel - Home | Facebook' => https://www.facebook.com/helloworldau/

CLI Example

Given google.json

{
  "/search": {
    "url": "https://www.google.com/search?q={{query}}",
    "list": "#rso div[class=g]",
    "result": {
      "title": "h3 span",
      "url": ["a[href]", "@href"]
    }
  }
}

$ scraper google.json
2015/05/16 20:10:46 listening on 3000...

$ curl "localhost:3000/search?query=hellokitty"
[
  {
    "title": "Official Home of Hello Kitty \u0026 Friends | Hello Kitty Shop",
    "url": "http://www.sanrio.com/"
  },
  {
    "title": "Hello Kitty - Wikipedia, the free encyclopedia",
    "url": "http://en.wikipedia.org/wiki/Hello_Kitty"
  },
  ...

JSON API

{
  <path>: {
    "method": <method>
    "url": <url>
    "list": <selector>,
    "result": {
      <field>: <extractor>,
      <field>: [<extractor>, <extractor>, ...],
      ...
    }
  }
}

<path> - Required The path of the scraper
- Accessible at http://<host>:port/<path>
- You may define path variables like: my/path/:var when set to /my/path/foo then :var = "foo"
<url> - Required The URL of the remote server to scrape
- It may contain template variables in the form {{ var }}, scraper will look for a var path variable, if not found, it will then look for a query parameter var
result - Required represents the resulting JSON object, after executing the <extractor> on the current DOM context. A field may use sequence of <extractor>s to perform more complex queries.
<method> - The HTTP request method (defaults to GET)
<extractor> - A string in which must be one of:
- a regex in form /abc/ - searches the text of the current DOM context (extracts the first group when provided).
- a regex in form s/abc/xyz/ - searches the text of the current DOM context and replaces with the provided text (sed-like syntax).
- an attribute in the form @abc - gets the attribute abc from the DOM context.
- a function in the form html() - gets the DOM context as string
- a function in the form trim() - trims space from the beginning and the end of the string
- a query param in the form query-param(abc) - parses the current context as a URL and extracts the provided param
- a css selector abc (if not in the forms above) alters the DOM context.
list - Optional A css selector used to split the root DOM context into a set of DOM contexts. Useful for capturing search results.

Go API

Replace <variable> with your configuration, documented above.

Define your endpoint struct:

type endpoint struct {
  Method string   `scraper:"<method>"`
  URL    string   `scraper:"<url>"`
  Result []result `scraper:"<list>`
  <param>  string `scraper:"<param>"`
}

Method, URL, Result and Debug are special fields, the remaining string fields are treated as input parameters. Input parameters use the field name with first character lowercased by default.

Define your result struct:

type result struct {
  <field> string `scraper:"<extractor>"`
  <field> string `scraper:"<extractor> | <extractor>"`
}

The result struct is used to define field to extractor mappings. All fields must be strings. Struct tags cannot contain arrays so instead we join multiple extractors with |.

Execute it:

e := endpoint{MyParam: "hello world"}
if err := scraper.Execute(&e); err != nil {
  ...
}
// e.Result is now set

Similar projects

https://github.com/ernesto-jimenez/scraperboardR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
example		example
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scraper

Features

Install

Go Example

CLI Example

JSON API

Go API

Similar projects

About

Uh oh!

Releases 9

Packages

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

License

jpillora/scraper

Folders and files

Latest commit

History

Repository files navigation

scraper

Features

Install

Go Example

CLI Example

JSON API

Go API

Similar projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Packages