diff --git a/docs/guide.markdown b/docs/guide.markdown new file mode 100644 index 0000000..4f80e9d --- /dev/null +++ b/docs/guide.markdown @@ -0,0 +1,563 @@ +A Guide to the Parsatron +======================== + +The Parsatron is a library for building parsers for languages. For an overview +of how it works internally you can watch [this talk][talk]. + +This document will show you the basics of how to use the Parsatron as an end +user. + +[talk]: http://www.infoq.com/presentations/Parser-Combinators + +Importing +--------- + +Assuming you have the library installed, you can grab all the things you'll +need by using it: + + (ns myparser.core + (:refer-clojure :exclude [char]) + (:use [the.parsatron])) + +Notice the exclusion of `clojure.core/char`, which would otherwise collide with +the `char` imported from the parsatron. + +You can, of course, use `:only` if you want, though that can get tedious very +quickly. + +Running +------- + +Let's see how to run a basic parser. It won't do much, but it will get +something on the screen so we can try things as we go. Assuming you've got +everything imported: + + (run (char \H) "Hello, world!") + ; \H + +The `run` function takes a parser and some input, runs the parser on that +input, and returns the result. + +The parser we passed here was `(char \H)`. We'll talk more about parsers in a +second, but for now just know that it's a parser that will parse a single "H" +character. + +Notice that it only parsed the first character, and even though there was more +left it still successfully returned. We'll talk about how to make sure that +there's no remaining input later. + +Input +----- + +We passed a string as the input to `run` in our first example, but the input +doesn't necessarily have to be a string. It can be any sequence. For example, +this works: + + (run (token #{1 2}) [1 "cats" :dogs]) + ; 1 + +The `(token #{1 2})` is a parser that matches the *integer* 1 or the *integer* +2, and we've passed it a vector of things. + +Errors +------ + +If the parser you give to `run` can't parse the input successfully, a +RuntimeException will be thrown: + + (run (char \q) "Hello, world!") + ; RuntimeException Unexpected token 'H' at line: 1 column: 1 ... + +The exception will tell you the line and column of the error, which is usually +quite helpful. + +Parsers +------- + +Now that we've got the basics, it's time to talk about how to create new +parsers. + +A "parser" is, technically, a function that takes 5 arguments and returns +a special value, but you don't need to worry about that yet. What you *do* need +to worry about is how to create them and combine them. + +When we ran `(char \H)` in the first example, it returned a parser. `char` +itself is a *function* that, when given a character, creates a parser that +parses that character. + +Read that again and make sure you understand it before moving on. `char` is +not a parser. It's a function that creates parsers. Character goes in, parser +comes out: + + (def h-parser (char \h)) + (run h-parser "hi") + ; \h + +Basic Built-In Parsers +---------------------- + +There are a few other basic parser-creating functions that you'll probably find +useful, which we'll talk about now. + +### token + +`token` creates parsers that match single items from the input stream (which +are characters if the input stream happens to be a string). You give it a +predicate, and it returns a parser that parses and returns items that match the +predicate. For example: + + (defn less-than-five [i] + (< i 5)) + + (run (token less-than-five) + [3]) + ; 3 + +The predicate can be any function, so things like anonymous functions and sets +work well. + +### char + +We've already seen `char`, which creates parsers that parse and return a +single, specific character. + + (run (char \H) "Hello, world!") + ; \H + +### any-char + +`any-char` creates parsers that will parse and return any character. Remember +that we can use the parsatron to parse more than just strings: + + (run (any-char) "Cats") + ; \C + + (run (any-char) [\C \a \t \s]) + ; \C + + (run (any-char) [1 2 3]) + ; RuntimeException... + +### letter and digit + +`letter` and `digits` create parsers that parse and return letter characters +(a-z and A-Z) and digit characters (0-9) respectively. + + (run (letter) "Dogs") + ; \D + + (run (digit) "100") + ; \1 + +Note that digit works with *character* objects. It won't work with actual +integers: + + (run (digit) [10 20 30]) + ; RuntimeException... + +If you want a parser that matches numbers in a non-string input sequence, use +`token` and the Clojure builtin function `number?` to make it: + + (run (token number?) [10 20 30]) + ; 10 + +### string + +`string` creates parsers that parse and return a sequence of characters given +as a string: + + (run (string "Hello") "Hello, world!") + ; "Hello" + +Note that this is the first time we've seen a parser that consumes more than +one item in the input sequence. + +### eof + +`eof` creates parsers that ensure the input stream doesn't contain anything else: + + (run (eof) "") + ; nil + + (run (eof) "a") + ; RuntimeException... + +On its own it's not very useful, but we'll need it once we learn how to combine +parsers. + +Combining Parsers +----------------- + +The Parsatron wouldn't be very useful if we could only ever parse one thing at +a time. There are a number of ways you can combine parsers to build up complex +ones from basic parts. + +### >> + +The `>>` macro is the simplest way to combine parsers. It takes any number of +parsers and creates a new parser. This new parser runs them in order and +returns the value of the last one. + +Again, `>>` takes *parsers* and returns a new *parser*. We'll see this many +times in this section. + +Here's an example: + + (def my-parser (>> (char \a) + (digit))) + + (run my-parser "a5") + ; \5 + + (run my-parser "5a") + ; RuntimeException... + + (run my-parser "b5") + ; RuntimeException... + + (run my-parser "aq") + ; RuntimeException... + +We create a parser from two other parsers with `>>` and run it on some input. +`>>` runs its constituent parsers in order, and they all have to match for it +to parse successfully. + +Now that we can combine parsers, we can also ensure that there's no garbage +after the stuff we parse by using `eof`: + + (run (>> (digit) (eof)) "1") + ; nil + + (run (>> (digit) (eof)) "1 cat") + ; RuntimeException... + +### times + +The next way to combine parsers (or, really, a parser with itself) is the +`times` function. + +`times` is a function that takes a count and a parser, and returns a parser that +repeats the one you gave it the specified number of times and returns the +results concatenated into a sequence. + +For example: + + (run (times 5 (letter)) "Hello, world!") + ; (\H \e \l \l \o) + +This is different than `(>> (letter) (letter) (letter) (letter) (letter))` +because it returns *all* of the parsers' results, not just the last one. + +### many + +`many` is the first creator of "open-ended" parsers we've seen. It's a function +that takes a parser and returns a new parser that will parse zero or more of the +one you gave it, and return the results concatenated into a sequence. + +For example: + + (run (many (digit)) "100 cats") + ; (\1 \0 \0) + +Now we can start to build much more powerful parsers: + + (def number-parser (many (digit))) + (def whitespace-parser (many (token #{\space \newline \tab}))) + + (run (>> number-parser whitespace-parser number-parser) "100 400") + ; (\4 \0 \0) + +We still need to talk about how to get more than just the last return value, but +that will come later. + +### many1 + +`many1` is just like `many`, except that the parsers it creates require at least +one item. It's like `+` in a regular expression instead of `*`. + + (def number-parser (many (digit))) + (def number-parser1 (many1 (digit))) + + (run number-parser "") + ; [] + + (run number-parser "100") + ; (\1 \0 \0) + + (run number-parser1 "") + ; RuntimeException... + + (run number-parser1 "100") + ; (\1 \0 \0) + +### choice + +`choice` takes one or more parsers and creates a parser that will try each of +them in order until one parses successfully, and return its result. For example: + + (def number (many1 (digit))) + (def word (many1 (letter))) + + (def number-or-word (choice number word)) + + (run number-or-word "dog") + ; (\d \o \g) + + (run number-or-word "42") + ; (\4 \2) + +Notice that we used `many1` when defining the parsers `number` and `word`. If +we had used `many` then this would always parse as a number because if there +were no digits it would successfully return an empty sequence. + +### between + +`between` is a function that takes three parsers, call them left, right, and +center. It creates a parser that parses them in left - center - right order and +returns the result of center. + +This is a convenient way to handle things like parentheses: + + (def whitespace-char (token #{\space \newline \tab})) + (def optional-whitespace (many whitespace-char)) + + (def open-paren (char \()) + (def close-paren (char \))) + + (def number (many1 (digit))) + + (run (between (>> open-paren optional-whitespace) + (>> optional-whitespace close-paren) + number) + "(123 )") + ; (\1 \2 \3) + +This example is a bit more complicated than we've seen so far, so slow down and +make sure you know what's going on. + +The three parsers we're giving to `between` are: + +1. `(>> open-paren optional-whitespace)` +2. `(>> optional-whitespace close-paren)` +3. `number` + +Once you're comfortable with this example, it's time to move on to the next +stage of parsing: building and returning values. + +Returning Values +---------------- + +So far we've looked at many ways to parse input. If you just need to validate +that input is in the correct format, but not *do* anything with it, you're all +set. But usually the goal of parsing something is to do things with it, so +let's look at how that works now. + +We've been using the word "returns" in a fast-and-loose fashion so far, but now +it's time to look a bit more closely at what it means in the Parsatron. + +### defparser and always + +When we looked at parsers created with `char` (like `(char \H)`) we said that +these parsers *returned* that character they parsed. That's not quite true. +They actually return a specially-wrapped value. + +If you want to know exactly what that special wrapping is, watch the [talk][]. +But you don't really need to understand the guts to use the Parsatron. You just +need to know how to create them. + +This is the first time we're going to be creating parsers that are more than +just simple combinations of existing ones. To do that we need to use a special +macro that handles setting them up properly: `defparser`. Look at the following +example (don't worry about what `always` is yet): + + (defparser sample [] + (string "Hello") + (always 42)) + +First of all, `defparser` doesn't define parsers. It defines functions that +*create* parsers, just like all of the ones we've seen so far. Yes, I know how +ridiculous that sounds. In practice it's only *slightly* confusing. + +So now we've got a function `sample` that we can use to create a parser by +calling it: + + (def my-sample-parser (sample)) + +Okay, now lets run it on some input: + + (run my-sample-parser "Hello, world!") + ; 42 + +There's a bunch of interesting things going on here, so let's slow down and take +a look. + +First, the parsers created by the functions `defparser` defines implicitely wrap +their bodies in `>>`, which as we've seen runs its argument parsers in order and +returns the last result. So our `(sample)` parser will run the "Hello" string +parser, and then the always parser (which it uses as the result). + +So what is this `always` thing? Well, remember at the beginning of this section +we said that parsers return a specially-wrapped value? `always` is a way to +simply stick a piece of data in this special wrapper so it can be the result of +a parser. + +Here's a little drawing that might help: + + raw input --> (run ...) --> raw output + | ^ + | | + | wrapped output + v | + (some parser) + +`run` takes the wrapped output from the parser and unwraps it for us before +returning it, which is why our `run` calls always gave us vanilla Clojure data +structures before. + +We're almost to the point where we can create full-featured parsers. The final +piece of the puzzle is a way to intercept results and make decisions inside of +our parsers. + +### let->> + +The `let->>` macro is the magic glue that's going to make creating your parsers +fun. In a nutshell, it lets you bind (unwrapped) parser results to names, which +you can then use normally. Let's just take a look at how it works: + + (defparser word [] + (many1 (letter))) + + (defparser greeting [] + (let->> [prefix (string "Hello, ") + name (word) + punctuation (choice (char \.) + (char \!))] + (if (= punctuation \!) + (always [(apply str name) :excited]) + (always [(apply str name) :not-excited])))) + + (run (greeting) "Hello, Cat!") + ; ["Cat" :excited] + + (run (greeting) "Hello, Dog.") + ; ["Dog" :not-excited] + +There's a lot happening here so let's look at it piece-by-piece. + +First we use `defparser` to make a `word` function for creating word parsers. +We could have done this with `(def word (many1 (letter)))` and then used it as +`word` later, but I find it's easier to just use `defparser` for everything. +That way we always get parsers the same way: by calling a function. + +Next we have our `greeting` parser (technically a function that makes a parser, +but you get the idea by now). Inside we have a `let->>` that runs three parsers +and binds their (unwrapped) results to names: + +1. `(string "Hello, ")` parses a literal string. `prefix` gets bound to the + string `"Hello, "`. +2. `(word)` parses one or more letters. `name` gets bound to the result, which + is a sequence of chars like `(\C \a \t)`. +3. `(choice (char \.) (char \!))` parses a period or exclamation point. + `punctuation` gets bound to the character that was parsed, like `\.` or `\!`. + +That's it for the binding section. Next we have the body of the `let->>`. This +needs to return a *wrapped* value, but we can do anything we like with our bound +variables to determine what to return. In this case we return different things +depending on whether the greeting ended with an exclamation point or not. + +Notice how the return values are wrapped in `(always ...)`. Also notice how all +the bound values have been unwrapped for us by `let->>`. `name` really is just +a sequence of characters which can be used with `(apply str ...)` as usual. + +You might wonder whether you can move the `(apply str ...)` into the `let->>` +binding form, so we don't have to do it twice. Unfortunately you can't. +**Every right hand side in a `let->>` binding form has to evaluate to a parser**. + +If you tried to do something like `(let->> [name (apply str (word))] ...)` it +wouldn't work for two reasons. First, `let->>` evaluates the right hand side +and expects the result to be a parser, which it then runs. So it would call +`(apply str some-word-parser)` and get a string back, which isn't a parser. + +Second, `let->>` unwraps the return value of `(word)` right before it binds it, +so even if the first problem weren't true, `(apply str ...)` would get a wrapped +value as its argument, which is not going to work. + +Of course, you can do anything you want in the *body* of a `let->>`, so this is +fine: + + (let->> [name (word)] + (let [name (apply str name)] + (always name))) + +`let` in this example is a vanilla Clojure `let`. + +Binding forms in a `let->>` are executed in order, and importantly, later forms +can refer to earlier ones. Look at this example: + + (defparser sample [] + (let->> [sign (choice (char \+) + (char \-)) + word (if (= sign \+) + (string "plus") + (string "minus"))] + (always [sign word]))) + + (run (sample) "+plus") + ; [\+ "plus"] + + (run (sample) "-minus") + ; [\- "minus"] + + (run (sample) "+minus") + ; RuntimeException... + +In this example, `sign` gets bound to the unwrapped result of the `choice` +parser, which is a character. Then we use that character to determine which +parser to use in the next binding. If the sign was a `\+`, we parse the string +`"plus"`. Likewise for minus. + +Notice how mixing the two in the last example produced an error. We saw the +`\+` and decided that we'd used the `(string "plus")` parser for the next input, +but it turned out to be `"minus"`. + +Tips and Tricks +--------------- + +That's about it for the basics! You now know enough to parse a wide variety of +things by building up complex parsers from very simple ones. + +Before you go, here's a few tips and tricks that you might find helpful. + +### You can parse more than just strings + +Remember that the Parsatron operates on sequences of input. These don't +necessarily have to be strings. + +Maybe you've got a big JSON response that you want to split apart. Don't try to +write a JSON parser from scratch, just use an existing one like [Cheshire][] and +then use the Parsatron to parse the Clojure datastructure(s) it sends back! + +[Cheshire]: https://github.com/dakrone/cheshire + +### You can throw away `let->>` bindings + +Sometimes you're writing a `let->>` form and encounter a value that you don't +really need to bind to a name. Instead of stopping the `let->>` and nesting +a `>>` inside it, just bind the value to a disposable name, like `_`: + + (defparser float [] + (let->> [integral (many1 (digit)) + _ (char \.) + fractional (many1 (digit))] + (let [integral (apply str integral) + fractional (apply str fractional)] + (always (Double/parseDouble (str integral "." fractional)))))) + + (run (float) "1.4") + ; 1.4 + + (run (float) "1.04") + ; 1.04 + + (run (float) "1.0400000") + ; 1.04