Skip to content
caustic lets you build scrapers in JSON, and run them on an android device.
Java
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
android
applet
console
core
default-impl
demos
doc
fixtures
implementation
schema/json
test
.gitignore
AUTHORS.txt
MIT-LICENSE.txt
README.md
VERSIONS.md
caustic
caustic-lite.jar
caustic.jar

README.md

caustic

portable scraper templates for mobile apps

getting started


The easiest way to try out caustic is the precompiled utility. Run

$ ./caustic '{"load":"http://www.google.com","then":{"find":"Feeling\\s[\\w]*","name":"Feeling?"}}'

in the terminal of your choice. This executes the JSON instruction

{
  "load"  : "http://www.google.com",
  "then" : {
    "find" : "Feeling\\s[\\w]*",
    "name" : "Feeling?"
  }
}

and sends the results to stdout

scopesource name value
1 0 Feeling? Feeling Lucky
2 0 Feeling? Feeling Lucky

First, caustic loads the URL in load. Then it looks for the regular expression in find, and saves all matches.

the instruction format


Caustics instructions are logic-free JSON objects that provide very dynamic templated instructions for scraping data. By default, substitutions are done for text inside double-curlies {{}}, kind of like mustache.

All caustic instructions are built from finds and loads.

Here's a simple instruction, which is one of the demos:

{
 "load" : "http://www.google.com/search?q={{query}}",
 "then"  : {
   "find"    : "{{query}}\\s+(\\w+)",
   "replace" : "I say '$1'!",
   "name"    : "what do you say after '{{query}}'?"
 }
}

For caustic to execute this instruction, it needs a value to substitute for {{query}}. Run the following

$ ./caustic demos/simple-google.json --input="query=hello"

to replace {{query}} with hello. We get the following

scopesourcenamevalue
0 query hello
1 0 what do you say after 'hello'? I say 'kitty'!
2 0 what do you say after 'hello'? I say 'lyrics'!
3 0 what do you say after 'hello'? I say 'lionel'!
4 0 what do you say after 'hello'? I say 'kitty'!
5 0 what do you say after 'hello'? I say 'beyonce'!
6 0 what do you say after 'hello'? I say 'beyonce'!
7 0 what do you say after 'hello'? I say 'glee'!
8 0 what do you say after 'hello'? I say 'movie'!

Not only is google queried for hello, but the substitution affects the name and replace of find.

We can also see that find can match multiple times.

We can use backreferences from $0 to $9 in replace.

advanced substitutions


Substitutions are a powerful tool because they develop over the course of execution. Any name that appears in curlies will be substituted once a value has been found for it.

This demo

{
  "load" : "http://www.google.com/search?q={{query}}",
  "then"  : {
    "find"    : "{{query}}\\s+(\\w+)",
    "replace" : "$1",
    "name"    : "after",
    "then" : {
      "load" : "http://www.google.com/search?q={{after}}",
      "then" : {
        "find"    : "{{query}}\\s+(\\w+)",
        "replace" : "I say '$1'!",
        "name"    : "what do you say after '{{after}}'?"
      }
    }
  }
}

takes advantage of dynamic substitution, along with the ability to place any number of load or find instructions inside then. It launches a whole new series of queries!

Try it with

$ ./caustic demos/complex-google.json --input="query=hello"

You'll see that this results in quite a few dozen rows, but here are some highlights:

scope source name value
48 14 what do you say after 'beyonce'? I say 'wedding'!
49 14 what do you say after 'beyonce'? I say 'songs'!
50 14 what do you say after 'beyonce'? I say 'youtube'!
51 14 what do you say after 'beyonce'? I say 'jay'!
52 14 what do you say after 'beyonce'? I say 'diet'!
53 14 what do you say after 'beyonce'? I say 'albums'!
54 14 what do you say after 'beyonce'? I say 'biography'!
55 14 what do you say after 'beyonce'? I say 'lyrics'!
56 15 what do you say after 'glee'? I say 'episodes'!
57 15 what do you say after 'glee'? I say 'tv'!
58 15 what do you say after 'glee'? I say 'spoilers'!
59 15 what do you say after 'glee'? I say 'songs'!
60 15 what do you say after 'glee'? I say 'soundtrack'!
61 15 what do you say after 'glee'? I say 'cast'!
62 15 what do you say after 'glee'? I say 'wiki'!
63 16 what do you say after 'movie'? I say 'download'!

Note that the source column links each find result back to the scope it inherits from.

references


You probably noticed that interior portion of the last demo was basically copy-and-pasted from the demo before it. Wouldn't it be nice if we could reuse instruction components?

This demo does just that

{
  "load" : "http://www.google.com/search?q={{query}}",
  "then"  : {
    "find"    : "{{query}}\\s+(\\w+)",
    "replace" : "$1",
    "name"    : "after",
    "then"    : "simple-google.json"
  }
}

Running

$ ./caustic demos/complex-google.json --input="query=hello"

should give you the same results as before. Any string appearing inside then will be evaulated as a reference.

remote templates


Templates can be accessed remotely. Running

$ ./caustic https://raw.git https://github.com/talos/caustic/blob/master/demos/simple-google.json --input="query=hello"

will do the first demo. References can be remote, too, even if the file is local. The prior demo will work the same if you alter then to read https://github.com/talos/caustic/blob/master/demos/simple-google.json

recursion


What if you want a scraper to run itself? No problem:

{
  "load"  : "http://www.google.com/search?q={{query}}",
  "then" : {
    "find"     : "{{query}}\\s+(\\w+)",
    "replace" : "$1",
    "name"   : "query",
    "then"   : "$this"
  }
}

When inside then, $this evaluates to be the entire object. This evaluation is only performed when then operates.

Remember that

$ ./caustic demos/recursive-google.json --input="query=hello"

will not stop on its own!

Why?


Caustic is designed to give wider access to obscure public data. The caustic format makes it easy to quickly design and test a scraper that extracts a few pieces of information from behind several layers of obfuscation.

Something went wrong with that request. Please try again.