Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

Add a basic guide to using the Parsatron.

  • Loading branch information...
commit a4410cee2e64cb76858e9234ad2f9bd7651367fa 1 parent fe6dd4b
Steve Losh authored

Showing 1 changed file with 563 additions and 0 deletions. Show diff stats Hide diff stats

  1. +563 0 docs/guide.markdown
563 docs/guide.markdown
Source Rendered
... ... @@ -0,0 +1,563 @@
  1 +A Guide to the Parsatron
  2 +========================
  3 +
  4 +The Parsatron is a library for building parsers for languages. For an overview
  5 +of how it works internally you can watch [this talk][talk].
  6 +
  7 +This document will show you the basics of how to use the Parsatron as an end
  8 +user.
  9 +
  10 +[talk]: http://www.infoq.com/presentations/Parser-Combinators
  11 +
  12 +Importing
  13 +---------
  14 +
  15 +Assuming you have the library installed, you can grab all the things you'll
  16 +need by using it:
  17 +
  18 + (ns myparser.core
  19 + (:refer-clojure :exclude [char])
  20 + (:use [the.parsatron]))
  21 +
  22 +Notice the exclusion of `clojure.core/char`, which would otherwise collide with
  23 +the `char` imported from the parsatron.
  24 +
  25 +You can, of course, use `:only` if you want, though that can get tedious very
  26 +quickly.
  27 +
  28 +Running
  29 +-------
  30 +
  31 +Let's see how to run a basic parser. It won't do much, but it will get
  32 +something on the screen so we can try things as we go. Assuming you've got
  33 +everything imported:
  34 +
  35 + (run (char \H) "Hello, world!")
  36 + ; \H
  37 +
  38 +The `run` function takes a parser and some input, runs the parser on that
  39 +input, and returns the result.
  40 +
  41 +The parser we passed here was `(char \H)`. We'll talk more about parsers in a
  42 +second, but for now just know that it's a parser that will parse a single "H"
  43 +character.
  44 +
  45 +Notice that it only parsed the first character, and even though there was more
  46 +left it still successfully returned. We'll talk about how to make sure that
  47 +there's no remaining input later.
  48 +
  49 +Input
  50 +-----
  51 +
  52 +We passed a string as the input to `run` in our first example, but the input
  53 +doesn't necessarily have to be a string. It can be any sequence. For example,
  54 +this works:
  55 +
  56 + (run (token #{1 2}) [1 "cats" :dogs])
  57 + ; 1
  58 +
  59 +The `(token #{1 2})` is a parser that matches the *integer* 1 or the *integer*
  60 +2, and we've passed it a vector of things.
  61 +
  62 +Errors
  63 +------
  64 +
  65 +If the parser you give to `run` can't parse the input successfully, a
  66 +RuntimeException will be thrown:
  67 +
  68 + (run (char \q) "Hello, world!")
  69 + ; RuntimeException Unexpected token 'H' at line: 1 column: 1 ...
  70 +
  71 +The exception will tell you the line and column of the error, which is usually
  72 +quite helpful.
  73 +
  74 +Parsers
  75 +-------
  76 +
  77 +Now that we've got the basics, it's time to talk about how to create new
  78 +parsers.
  79 +
  80 +A "parser" is, technically, a function that takes 5 arguments and returns
  81 +a special value, but you don't need to worry about that yet. What you *do* need
  82 +to worry about is how to create them and combine them.
  83 +
  84 +When we ran `(char \H)` in the first example, it returned a parser. `char`
  85 +itself is a *function* that, when given a character, creates a parser that
  86 +parses that character.
  87 +
  88 +Read that again and make sure you understand it before moving on. `char` is
  89 +not a parser. It's a function that creates parsers. Character goes in, parser
  90 +comes out:
  91 +
  92 + (def h-parser (char \h))
  93 + (run h-parser "hi")
  94 + ; \h
  95 +
  96 +Basic Built-In Parsers
  97 +----------------------
  98 +
  99 +There are a few other basic parser-creating functions that you'll probably find
  100 +useful, which we'll talk about now.
  101 +
  102 +### token
  103 +
  104 +`token` creates parsers that match single items from the input stream (which
  105 +are characters if the input stream happens to be a string). You give it a
  106 +predicate, and it returns a parser that parses and returns items that match the
  107 +predicate. For example:
  108 +
  109 + (defn less-than-five [i]
  110 + (< i 5))
  111 +
  112 + (run (token less-than-five)
  113 + [3])
  114 + ; 3
  115 +
  116 +The predicate can be any function, so things like anonymous functions and sets
  117 +work well.
  118 +
  119 +### char
  120 +
  121 +We've already seen `char`, which creates parsers that parse and return a
  122 +single, specific character.
  123 +
  124 + (run (char \H) "Hello, world!")
  125 + ; \H
  126 +
  127 +### any-char
  128 +
  129 +`any-char` creates parsers that will parse and return any character. Remember
  130 +that we can use the parsatron to parse more than just strings:
  131 +
  132 + (run (any-char) "Cats")
  133 + ; \C
  134 +
  135 + (run (any-char) [\C \a \t \s])
  136 + ; \C
  137 +
  138 + (run (any-char) [1 2 3])
  139 + ; RuntimeException...
  140 +
  141 +### letter and digit
  142 +
  143 +`letter` and `digits` create parsers that parse and return letter characters
  144 +(a-z and A-Z) and digit characters (0-9) respectively.
  145 +
  146 + (run (letter) "Dogs")
  147 + ; \D
  148 +
  149 + (run (digit) "100")
  150 + ; \1
  151 +
  152 +Note that digit works with *character* objects. It won't work with actual
  153 +integers:
  154 +
  155 + (run (digit) [10 20 30])
  156 + ; RuntimeException...
  157 +
  158 +If you want a parser that matches numbers in a non-string input sequence, use
  159 +`token` and the Clojure builtin function `number?` to make it:
  160 +
  161 + (run (token number?) [10 20 30])
  162 + ; 10
  163 +
  164 +### string
  165 +
  166 +`string` creates parsers that parse and return a sequence of characters given
  167 +as a string:
  168 +
  169 + (run (string "Hello") "Hello, world!")
  170 + ; "Hello"
  171 +
  172 +Note that this is the first time we've seen a parser that consumes more than
  173 +one item in the input sequence.
  174 +
  175 +### eof
  176 +
  177 +`eof` creates parsers that ensure the input stream doesn't contain anything else:
  178 +
  179 + (run (eof) "")
  180 + ; nil
  181 +
  182 + (run (eof) "a")
  183 + ; RuntimeException...
  184 +
  185 +On its own it's not very useful, but we'll need it once we learn how to combine
  186 +parsers.
  187 +
  188 +Combining Parsers
  189 +-----------------
  190 +
  191 +The Parsatron wouldn't be very useful if we could only ever parse one thing at
  192 +a time. There are a number of ways you can combine parsers to build up complex
  193 +ones from basic parts.
  194 +
  195 +### >>
  196 +
  197 +The `>>` macro is the simplest way to combine parsers. It takes any number of
  198 +parsers and creates a new parser. This new parser runs them in order and
  199 +returns the value of the last one.
  200 +
  201 +Again, `>>` takes *parsers* and returns a new *parser*. We'll see this many
  202 +times in this section.
  203 +
  204 +Here's an example:
  205 +
  206 + (def my-parser (>> (char \a)
  207 + (digit)))
  208 +
  209 + (run my-parser "a5")
  210 + ; \5
  211 +
  212 + (run my-parser "5a")
  213 + ; RuntimeException...
  214 +
  215 + (run my-parser "b5")
  216 + ; RuntimeException...
  217 +
  218 + (run my-parser "aq")
  219 + ; RuntimeException...
  220 +
  221 +We create a parser from two other parsers with `>>` and run it on some input.
  222 +`>>` runs its constituent parsers in order, and they all have to match for it
  223 +to parse successfully.
  224 +
  225 +Now that we can combine parsers, we can also ensure that there's no garbage
  226 +after the stuff we parse by using `eof`:
  227 +
  228 + (run (>> (digit) (eof)) "1")
  229 + ; nil
  230 +
  231 + (run (>> (digit) (eof)) "1 cat")
  232 + ; RuntimeException...
  233 +
  234 +### times
  235 +
  236 +The next way to combine parsers (or, really, a parser with itself) is the
  237 +`times` function.
  238 +
  239 +`times` is a function that takes a count and a parser, and returns a parser that
  240 +repeats the one you gave it the specified number of times and returns the
  241 +results concatenated into a sequence.
  242 +
  243 +For example:
  244 +
  245 + (run (times 5 (letter)) "Hello, world!")
  246 + ; (\H \e \l \l \o)
  247 +
  248 +This is different than `(>> (letter) (letter) (letter) (letter) (letter))`
  249 +because it returns *all* of the parsers' results, not just the last one.
  250 +
  251 +### many
  252 +
  253 +`many` is the first creator of "open-ended" parsers we've seen. It's a function
  254 +that takes a parser and returns a new parser that will parse zero or more of the
  255 +one you gave it, and return the results concatenated into a sequence.
  256 +
  257 +For example:
  258 +
  259 + (run (many (digit)) "100 cats")
  260 + ; (\1 \0 \0)
  261 +
  262 +Now we can start to build much more powerful parsers:
  263 +
  264 + (def number-parser (many (digit)))
  265 + (def whitespace-parser (many (token #{\space \newline \tab})))
  266 +
  267 + (run (>> number-parser whitespace-parser number-parser) "100 400")
  268 + ; (\4 \0 \0)
  269 +
  270 +We still need to talk about how to get more than just the last return value, but
  271 +that will come later.
  272 +
  273 +### many1
  274 +
  275 +`many1` is just like `many`, except that the parsers it creates require at least
  276 +one item. It's like `+` in a regular expression instead of `*`.
  277 +
  278 + (def number-parser (many (digit)))
  279 + (def number-parser1 (many1 (digit)))
  280 +
  281 + (run number-parser "")
  282 + ; []
  283 +
  284 + (run number-parser "100")
  285 + ; (\1 \0 \0)
  286 +
  287 + (run number-parser1 "")
  288 + ; RuntimeException...
  289 +
  290 + (run number-parser1 "100")
  291 + ; (\1 \0 \0)
  292 +
  293 +### choice
  294 +
  295 +`choice` takes one or more parsers and creates a parser that will try each of
  296 +them in order until one parses successfully, and return its result. For example:
  297 +
  298 + (def number (many1 (digit)))
  299 + (def word (many1 (letter)))
  300 +
  301 + (def number-or-word (choice number word))
  302 +
  303 + (run number-or-word "dog")
  304 + ; (\d \o \g)
  305 +
  306 + (run number-or-word "42")
  307 + ; (\4 \2)
  308 +
  309 +Notice that we used `many1` when defining the parsers `number` and `word`. If
  310 +we had used `many` then this would always parse as a number because if there
  311 +were no digits it would successfully return an empty sequence.
  312 +
  313 +### between
  314 +
  315 +`between` is a function that takes three parsers, call them left, right, and
  316 +center. It creates a parser that parses them in left - center - right order and
  317 +returns the result of center.
  318 +
  319 +This is a convenient way to handle things like parentheses:
  320 +
  321 + (def whitespace-char (token #{\space \newline \tab}))
  322 + (def optional-whitespace (many whitespace-char))
  323 +
  324 + (def open-paren (char \())
  325 + (def close-paren (char \)))
  326 +
  327 + (def number (many1 (digit)))
  328 +
  329 + (run (between (>> open-paren optional-whitespace)
  330 + (>> optional-whitespace close-paren)
  331 + number)
  332 + "(123 )")
  333 + ; (\1 \2 \3)
  334 +
  335 +This example is a bit more complicated than we've seen so far, so slow down and
  336 +make sure you know what's going on.
  337 +
  338 +The three parsers we're giving to `between` are:
  339 +
  340 +1. `(>> open-paren optional-whitespace)`
  341 +2. `(>> optional-whitespace close-paren)`
  342 +3. `number`
  343 +
  344 +Once you're comfortable with this example, it's time to move on to the next
  345 +stage of parsing: building and returning values.
  346 +
  347 +Returning Values
  348 +----------------
  349 +
  350 +So far we've looked at many ways to parse input. If you just need to validate
  351 +that input is in the correct format, but not *do* anything with it, you're all
  352 +set. But usually the goal of parsing something is to do things with it, so
  353 +let's look at how that works now.
  354 +
  355 +We've been using the word "returns" in a fast-and-loose fashion so far, but now
  356 +it's time to look a bit more closely at what it means in the Parsatron.
  357 +
  358 +### defparser and always
  359 +
  360 +When we looked at parsers created with `char` (like `(char \H)`) we said that
  361 +these parsers *returned* that character they parsed. That's not quite true.
  362 +They actually return a specially-wrapped value.
  363 +
  364 +If you want to know exactly what that special wrapping is, watch the [talk][].
  365 +But you don't really need to understand the guts to use the Parsatron. You just
  366 +need to know how to create them.
  367 +
  368 +This is the first time we're going to be creating parsers that are more than
  369 +just simple combinations of existing ones. To do that we need to use a special
  370 +macro that handles setting them up properly: `defparser`. Look at the following
  371 +example (don't worry about what `always` is yet):
  372 +
  373 + (defparser sample []
  374 + (string "Hello")
  375 + (always 42))
  376 +
  377 +First of all, `defparser` doesn't define parsers. It defines functions that
  378 +*create* parsers, just like all of the ones we've seen so far. Yes, I know how
  379 +ridiculous that sounds. In practice it's only *slightly* confusing.
  380 +
  381 +So now we've got a function `sample` that we can use to create a parser by
  382 +calling it:
  383 +
  384 + (def my-sample-parser (sample))
  385 +
  386 +Okay, now lets run it on some input:
  387 +
  388 + (run my-sample-parser "Hello, world!")
  389 + ; 42
  390 +
  391 +There's a bunch of interesting things going on here, so let's slow down and take
  392 +a look.
  393 +
  394 +First, the parsers created by the functions `defparser` defines implicitely wrap
  395 +their bodies in `>>`, which as we've seen runs its argument parsers in order and
  396 +returns the last result. So our `(sample)` parser will run the "Hello" string
  397 +parser, and then the always parser (which it uses as the result).
  398 +
  399 +So what is this `always` thing? Well, remember at the beginning of this section
  400 +we said that parsers return a specially-wrapped value? `always` is a way to
  401 +simply stick a piece of data in this special wrapper so it can be the result of
  402 +a parser.
  403 +
  404 +Here's a little drawing that might help:
  405 +
  406 + raw input --> (run ...) --> raw output
  407 + | ^
  408 + | |
  409 + | wrapped output
  410 + v |
  411 + (some parser)
  412 +
  413 +`run` takes the wrapped output from the parser and unwraps it for us before
  414 +returning it, which is why our `run` calls always gave us vanilla Clojure data
  415 +structures before.
  416 +
  417 +We're almost to the point where we can create full-featured parsers. The final
  418 +piece of the puzzle is a way to intercept results and make decisions inside of
  419 +our parsers.
  420 +
  421 +### let->>
  422 +
  423 +The `let->>` macro is the magic glue that's going to make creating your parsers
  424 +fun. In a nutshell, it lets you bind (unwrapped) parser results to names, which
  425 +you can then use normally. Let's just take a look at how it works:
  426 +
  427 + (defparser word []
  428 + (many1 (letter)))
  429 +
  430 + (defparser greeting []
  431 + (let->> [prefix (string "Hello, ")
  432 + name (word)
  433 + punctuation (choice (char \.)
  434 + (char \!))]
  435 + (if (= punctuation \!)
  436 + (always [(apply str name) :excited])
  437 + (always [(apply str name) :not-excited]))))
  438 +
  439 + (run (greeting) "Hello, Cat!")
  440 + ; ["Cat" :excited]
  441 +
  442 + (run (greeting) "Hello, Dog.")
  443 + ; ["Dog" :not-excited]
  444 +
  445 +There's a lot happening here so let's look at it piece-by-piece.
  446 +
  447 +First we use `defparser` to make a `word` function for creating word parsers.
  448 +We could have done this with `(def word (many1 (letter)))` and then used it as
  449 +`word` later, but I find it's easier to just use `defparser` for everything.
  450 +That way we always get parsers the same way: by calling a function.
  451 +
  452 +Next we have our `greeting` parser (technically a function that makes a parser,
  453 +but you get the idea by now). Inside we have a `let->>` that runs three parsers
  454 +and binds their (unwrapped) results to names:
  455 +
  456 +1. `(string "Hello, ")` parses a literal string. `prefix` gets bound to the
  457 + string `"Hello, "`.
  458 +2. `(word)` parses one or more letters. `name` gets bound to the result, which
  459 + is a sequence of chars like `(\C \a \t)`.
  460 +3. `(choice (char \.) (char \!))` parses a period or exclamation point.
  461 + `punctuation` gets bound to the character that was parsed, like `\.` or `\!`.
  462 +
  463 +That's it for the binding section. Next we have the body of the `let->>`. This
  464 +needs to return a *wrapped* value, but we can do anything we like with our bound
  465 +variables to determine what to return. In this case we return different things
  466 +depending on whether the greeting ended with an exclamation point or not.
  467 +
  468 +Notice how the return values are wrapped in `(always ...)`. Also notice how all
  469 +the bound values have been unwrapped for us by `let->>`. `name` really is just
  470 +a sequence of characters which can be used with `(apply str ...)` as usual.
  471 +
  472 +You might wonder whether you can move the `(apply str ...)` into the `let->>`
  473 +binding form, so we don't have to do it twice. Unfortunately you can't.
  474 +**Every right hand side in a `let->>` binding form has to evaluate to a parser**.
  475 +
  476 +If you tried to do something like `(let->> [name (apply str (word))] ...)` it
  477 +wouldn't work for two reasons. First, `let->>` evaluates the right hand side
  478 +and expects the result to be a parser, which it then runs. So it would call
  479 +`(apply str some-word-parser)` and get a string back, which isn't a parser.
  480 +
  481 +Second, `let->>` unwraps the return value of `(word)` right before it binds it,
  482 +so even if the first problem weren't true, `(apply str ...)` would get a wrapped
  483 +value as its argument, which is not going to work.
  484 +
  485 +Of course, you can do anything you want in the *body* of a `let->>`, so this is
  486 +fine:
  487 +
  488 + (let->> [name (word)]
  489 + (let [name (apply str name)]
  490 + (always name)))
  491 +
  492 +`let` in this example is a vanilla Clojure `let`.
  493 +
  494 +Binding forms in a `let->>` are executed in order, and importantly, later forms
  495 +can refer to earlier ones. Look at this example:
  496 +
  497 + (defparser sample []
  498 + (let->> [sign (choice (char \+)
  499 + (char \-))
  500 + word (if (= sign \+)
  501 + (string "plus")
  502 + (string "minus"))]
  503 + (always [sign word])))
  504 +
  505 + (run (sample) "+plus")
  506 + ; [\+ "plus"]
  507 +
  508 + (run (sample) "-minus")
  509 + ; [\- "minus"]
  510 +
  511 + (run (sample) "+minus")
  512 + ; RuntimeException...
  513 +
  514 +In this example, `sign` gets bound to the unwrapped result of the `choice`
  515 +parser, which is a character. Then we use that character to determine which
  516 +parser to use in the next binding. If the sign was a `\+`, we parse the string
  517 +`"plus"`. Likewise for minus.
  518 +
  519 +Notice how mixing the two in the last example produced an error. We saw the
  520 +`\+` and decided that we'd used the `(string "plus")` parser for the next input,
  521 +but it turned out to be `"minus"`.
  522 +
  523 +Tips and Tricks
  524 +---------------
  525 +
  526 +That's about it for the basics! You now know enough to parse a wide variety of
  527 +things by building up complex parsers from very simple ones.
  528 +
  529 +Before you go, here's a few tips and tricks that you might find helpful.
  530 +
  531 +### You can parse more than just strings
  532 +
  533 +Remember that the Parsatron operates on sequences of input. These don't
  534 +necessarily have to be strings.
  535 +
  536 +Maybe you've got a big JSON response that you want to split apart. Don't try to
  537 +write a JSON parser from scratch, just use an existing one like [Cheshire][] and
  538 +then use the Parsatron to parse the Clojure datastructure(s) it sends back!
  539 +
  540 +[Cheshire]: https://github.com/dakrone/cheshire
  541 +
  542 +### You can throw away `let->>` bindings
  543 +
  544 +Sometimes you're writing a `let->>` form and encounter a value that you don't
  545 +really need to bind to a name. Instead of stopping the `let->>` and nesting
  546 +a `>>` inside it, just bind the value to a disposable name, like `_`:
  547 +
  548 + (defparser float []
  549 + (let->> [integral (many1 (digit))
  550 + _ (char \.)
  551 + fractional (many1 (digit))]
  552 + (let [integral (apply str integral)
  553 + fractional (apply str fractional)]
  554 + (always (Double/parseDouble (str integral "." fractional))))))
  555 +
  556 + (run (float) "1.4")
  557 + ; 1.4
  558 +
  559 + (run (float) "1.04")
  560 + ; 1.04
  561 +
  562 + (run (float) "1.0400000")
  563 + ; 1.04

0 comments on commit a4410ce

Please sign in to comment.
Something went wrong with that request. Please try again.