-
Notifications
You must be signed in to change notification settings - Fork 113
/
nom-intro.md
546 lines (432 loc) · 17.8 KB
/
nom-intro.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
## Parsing Text with Nom
[Nom](https://github.com/Geal/nom), [(documented here)](https://docs.rs/nom) is a parser library
for Rust which is well worth the initial time investment.
If you have to parse a known data format, like CSV or JSON, then
it's best to use a specialized library like [Rust CSV](https://github.com/BurntSushi/rust-csv) or
the JSON libraries discussed in [Section 4](4-modules.html#cargo).
Likewise, for configuration files
use dedicated parsers like [ini](https://docs.rs/rust-ini/0.10.0/ini/) or
[toml](http://alexcrichton.com/toml-rs/toml/index.html). (The last one is particularly cool since
it integrates with the Serde framework, just as we saw with [serde_json](https://docs.rs/serde_json).
But if the text is not regular, or some made-up format, then you need to scan that text without
writing a lot of tedious string-processing code. The suggested go-to is often [regex](https://github.com/rust-lang/regex),
but regexes can be frustratingly opaque when sufficiently involved. Nom provides a way to parse
text which is just as powerful and can be built up by combining simpler parsers. And regexes have
their limits, for instance, don't [use regexes for parsing HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)
but you _could_ use Nom to parse HTML. If you ever had the itch to write your own programming
language, Nom is a good place for you start on that hard road to obscurity.
There are some excellent tutorials for learning Nom, but I want to start at the hello-world
level to build some initial familiarity. The basic things you need to know - first, Nom is macros all the
way down, and second, Nom prefers to work with byte slices, not strings. The first means that you have to
be especially careful to get Nom expressions right, because the error messages are not going to be
friendly. And the second means that Nom can be used for _any_ data format, not just text. People
have used Nom to decode binary protocols and file headers. It can also work with 'text'
in encodings other than UTF-8.
Recent versions of Nom work fine with string slices, although you need to use the macros that
end with `_s`.
```rust
#[macro_use]
extern crate nom;
named!(get_greeting<&str,&str>,
tag_s!("hi")
);
fn main() {
let res = get_greeting("hi there");
println!("{:?}",res);
}
// Done(" there", "hi")
```
The `named!` macro creates functions which take some input type (`&[u8]` by default)
and return the second type in angle brackets.
`tag_s!` matches a literal string in the stream of characters, and its value is
a string slice representing that literal. (If you wanted to work with `&[u8]` then
use the `tag!` macro.)
We call the defined parser `get_greeting` with a `&str` and
get back an [IResult](http://rust.unhandledexpression.com/nom/enum.IResult.html).
And indeed we get back the matching value.
Look at " there" - This is the string slice left over after matching..
We want to ignore whitespace. By just wrapping the `tag!` in `ws!` we can match "hi" anywhere
among spaces, tabs or newlines:
```rust
named!(get_greeting<&str,&str>,
ws!(tag_s!("hi"))
);
fn main() {
let res = get_greeting("hi there");
println!("{:?}",res);
}
// Done("there", "hi")
```
The result is "hi" as before, and the remaining string is "there"! The spaces have been skipped.
"hi" is matching nicely, although this isn't very useful yet.
Let's match _either_ "hi" or "bye". The `alt!` macro ("alternate") takes parser expressions
separated by `|` and matches _any_ of them. Note that you can use whitespace here to make
the parser function easier to read:
```rust
named!(get_greeting<&str>,
ws!(alt!(tag_s!("hi") | tag_s!("bye")))
);
println!("{:?}", get_greeting(" hi "));
println!("{:?}", get_greeting(" bye "));
println!("{:?}", get_greeting(" hola "));
// Done("", "hi")
// Done("", "bye")
// Error(Alt)
```
The last match failed because there is no alternative that matches "hola".
Clearly we need to understand this `IResult` type to go further, but first let's compare this
with the regex solution:
```rust
let greetings = Regex::new(r"\s*(hi|bye)\s*").expect("bad regex");
let caps = greetings.captures(" hi ").expect("match failed");
println!("{:?}",caps);
// Captures({0: Some(" hi "), 1: Some("hi")})
```
Regular expressions are certainly more _compact_!.
We needed to put '()' around the two possibilities
separated by '|' so that we will _capture_ the greeting and nothing else. The first result is the
whole string, the second is the matched capture. ('|' is the so-called 'alternation' operator in
regexes, which is the motivation for the `alt!` macro syntax.)
But this is a very simple regex, and they get complicated very quickly. Being a text mini-language, you
have to escape significant characters like `*` and `(`. If I wanted to match "(hi)" or
"(bye)" the regex becomes "\s*\((hi|bye)\)\s*" but the Nom parser simply becomes
`alt!(tag_s!("(hi)") | tag_s!("(bye)"))`.
It's also a heavy-weight dependency. On this fairly feeble i5 laptop, Nom examples take about 0.55
seconds to compile, which is not much more than "Hello world". But the regex examples take about
0.90s. And the stripped release build executable of the Nom example is about 0.3Mb (which is about
as small as statically linked Rust programs go) versus 0.8Mb for the regex example.
## What a Nom Parser returns
[IResult](http://rust.unhandledexpression.com/nom/enum.IResult.html) has an interesting difference
from the standard `Result` type - there are three possibilities:
- `Done` - success - you get both the result and the remaining bytes
- `Error` - failed to parse - you get an error
- `Imcomplete` - more data needed
We can write a generic `dump` function that handles any return value that can be debug-printed.
This also demonstrates the `to_result` method which returns a regular `Result` - this is probably
the method you will use for most cases since it returns either the returned value or an error.
```rust
#[macro_use]
extern crate nom;
use nom::IResult;
use std::str::from_utf8;
use std::fmt::Debug;
fn dump<T: Debug>(res: IResult<&str,T>) {
match res {
IResult::Done(rest, value) => {println!("Done {:?} {:?}",rest,value)},
IResult::Error(err) => {println!("Err {:?}",err)},
IResult::Incomplete(needed) => {println!("Needed {:?}",needed)}
}
}
fn main() {
named!(get_greeting<&str,&str>,
ws!(
alt!( tag_s!("hi") | tag_s!("bye"))
)
);
dump(get_greeting(" hi "));
dump(get_greeting(" bye hi"));
dump(get_greeting(" hola "));
println!("result {:?}", get_greeting(" bye ").to_result());
}
// Done "" "hi"
// Done "hi" "bye"
// Err Alt
// result Ok("bye")
```
Parsers returning any unparsed text, and being able to indicate that they don't have enough
input characters to decide, is very useful for stream parsing. But usually `to_result` is your friend.
## Combining Parsers
Let's continue the greeting example and imagine that a greeting consists of "hi" or "bye", plus
a name. `nom::alpha` matches a series of alphabetical characters.
The `pair!` macro will collect the result of matching two parsers as a tuple:
```rust
named!(full_greeting<&str,(&str,&str)>,
pair!(
get_greeting,
nom::alpha
)
);
println!("result {:?}", full_greeting(" hi Bob ").to_result());
// result Ok(("hi", "Bob"))
```
Now, further imagine that the greeter is perhaps a little shy or doesn't know anybody's name:
let us make the name optional. Naturally, the second value of the tuple becomes an `Option`.
```rust
named!(full_greeting<&str, (&str,Option<&str>)>,
pair!(
get_greeting,
opt!(nom::alpha)
)
);
println!("result {:?}", full_greeting(" hi Bob ").to_result());
println!("result {:?}", full_greeting(" bye ?").to_result());
// result Ok(("hi", Some("Bob")))
// result Ok(("bye", None))
```
Notice that it was straightforward to combine an existing parser for greetings with a parser
that picks up names, and then it was easy to make that name optional. This is the great power of Nom,
and it's why it's called a "parser combinator library". You can build up your complicated
parsers from simpler parsers, which you can test individually. (At this point, the equivalent
regex is starting to look like a Perl program: regexes do not combine well.)
However, we are not yet home and dry! `full_greeting(" bye ")` will fail with an
`Imcomplete` error. Nom knows that "bye" may be followed by a name and wants us to give it more
data. This is how a _streaming_ parser needs to work, so you can feed it a file chunk by chunk,
but here we need to tell Nom that the input is complete.
```rust
named!(full_greeting<&str,(&str,Option<&str>)>,
pair!(
get_greeting,
opt!(complete!(nom::alpha))
)
);
println!("result {:?}", full_greeting(" bye ").to_result());
// result Ok(("bye", None))
```
## Parsing Numbers
Nom provides a function `digit` which matches a series of numerical digits.
So we use `map!`, to convert the string into an integer,
and return the full `Result` type.
```rust
use nom::digit;
use std::str::FromStr;
use std::num::ParseIntError;
named!(int8 <&str, Result<i8,ParseIntError>>,
map!(digit, FromStr::from_str)
);
named!(int32 <&str, Result<i32,ParseIntError>>,
map!(digit, FromStr::from_str)
);
println!("{:?}", int8("120"));
println!("{:?}", int8("1200"));
println!("{:?}", int8("x120"));
println!("{:?}", int32("1202"));
// Done("", Ok(120))
// Done("", Err(ParseIntError { kind: Overflow }))
// Error(Digit)
// Done("", Ok(1202))
```
So what we get is a parser `IResult` containing a conversion `Result` - and sure enough, there
is more than one way to fail here. Note that the body of our converting function has exactly
the same code; the actual conversion depends on the return type of the function.
Integers may have a sign. We can capture integers as a pair, where the first
value may be a sign, and the second value would be any digits following.
Consider:
```rust
named!(signed_digits<&str, (Option<&str>,&str)>,
pair!(
opt!(alt!(tag_s!("+") | tag_s!("-"))), // maybe sign?
digit
)
);
println!("signed {:?}", signed_digits("4"));
println!("signed {:?}", signed_digits("+12"));
// signed Done("", (None, "4"))
// signed Done("", (Some("+"), "12"))
```
When we aren't interested in the intermediate results, but just want all the matching
input, then `recognize!` is what you need.
```rust
named!(maybe_signed_digits<&str,&str>,
recognize!(signed_digits)
);
println!("signed {:?}", maybe_signed_digits("+12"));
// signed Done("", "+12")
```
With this technique, we can recognize floating-point numbers. Again we map to string slice
from the byte slice over all these matches. `tuple!` is the generalization of `pair!`,
although we aren't interested in the generated tuple here. `complete!` is needed to resolve
the same problem we had with incomplete greetings - "12" is a valid number without the
optional floating-point part.
```rust
named!(floating_point<&str,&str>,
recognize!(
tuple!(
maybe_signed_digits,
opt!(complete!(pair!(
tag_s!("."),
digit
))),
opt!(complete!(pair!(
alt!(tag_s!("e") | tag_s!("E")),
maybe_signed_digits
)))
)
)
);
```
By defining a little helper macro, we get some passing tests. The test
passes if `floating_point` matches all of the string that it is given.
```rust
macro_rules! nom_eq {
($p:expr,$e:expr) => (
assert_eq!($p($e).to_result().unwrap(), $e)
)
}
nom_eq!(floating_point, "+2343");
nom_eq!(floating_point, "-2343");
nom_eq!(floating_point, "2343");
nom_eq!(floating_point, "2343.23");
nom_eq!(floating_point, "2e20");
nom_eq!(floating_point, "2.0e-6");
```
(Although sometimes macros feel a _little_ dirty, making your tests pretty is a fine thing.)
And then we can parse and convert floating point numbers. Here I'll throw caution to the
winds and throw away the error:
```rust
named!(float64<f64>,
map_res!(floating_point, FromStr::from_str)
);
```
Please note how it's possible to build up complicated parsers step by step, testing each
part in isolation first. That's a strong advantage of parser combinators over regexes.
It is very much the classic programming tactic of divide-and-rule.
## Operations over Multiple Matches
We've met `pairs!` and `tuple!` which capture a fixed number of matches as Rust tuples.
There is also `many0` and `many1` - they both capture indefinite numbers of matches as vectors.
The difference is that the first may capture 'zero or many' and the second 'one or many' (like the
difference between the regex `*` versus `+` modifiers.) So `many1!(ws!(float64))` would
parse "1 2 3" into `vec![1.0,2.0,3.0]`, but will fail on the empty string.
`fold_many0` is a _reducing_ operation. The match values are combined into a single value,
using a binary operator.
For instance, this is how Rust people did sums over iterators before `sum` was added; this fold
starts with an initial value (here zero) for the _accumulator_ and keeps adding values to
that accumulator using `+`.
```rust
let res = [1,2,3].iter().fold(0,|acc,v| acc + v);
println!("{}",res);
// 6
```
Here's the Nom equivalent:
```rust
named!(fold_sum<&str,f64>,
fold_many1!(
ws!(float64),
0.0,
|acc, v| acc + v
)
);
println!("fold {}", fold_sum("1 2 3").to_result().unwrap());
//fold 6
```
Up to now, we've had to capture every expression, or just grab all matching bytes with `recognize!`:
```rust
named!(pointf<(f64,&[u8],f64)>,
tuple!(
float64,
tag_s!(","),
float64
)
);
println!("got {:?}", nom_res!(pointf,"20,52.2").unwrap());
//got (20, ",", 52.2)
```
For more complicated expressions, capturing the results of all the parsers leads to
rather untidy types! We can do better.
`do_parse!` lets you extract only the values you're interested in. The matches are separated
with `>>` - the matches of interest are of the form `name: parser`. Finally, there's a code
block in parentheses.
```rust
#[derive(Debug)]
struct Point {
x: f64,
y: f64
}
named!(pointf<Point>,
do_parse!(
first: float64 >>
tag_s!(",") >>
second: float64
>>
(Point{x: first, y: second})
)
);
println!("got {:?}", nom_res!(pointf,"20,52.2").unwrap());
// got Point { x: 20, y: 52.2 }
```
We're not interested in that tag's value (it can only be a comma) but we assign the two float values
to temporary values which are used to build a struct. The code at the end can be any Rust
expression.
## Parsing Arithmetic Expressions
With the necessary background established, we can do simple arithmetic expressions.
This is a good example of something that really can't be done with regexes.
The idea is to build up expressions from the bottom up. Expressions consist of _terms_, which are
added or subtracted. Terms consist of _factors_, which are multiplied or divided. And (for now)
factors are just floating-point numbers:
```rust
named!(factor<f64>,
ws!(float64)
);
named!(term<&str,f64>, do_parse!(
init: factor >>
res: fold_many0!(
tuple!(
alt!(tag_s!("*") | tag_s!("/")),
factor
),
init,
|acc, v:(_,f64)| {
if v.0 == "*" {acc * v.1} else {acc / v.1}
}
)
>> (res)
));
named!(expr<&str,f64>, do_parse!(
init: term >>
res: fold_many0!(
tuple!(
alt!(tag_s!("+") | tag_s!("-")),
term
),
init,
|acc, v:(_,f64)| {
if v.0 == "+" {acc + v.1} else {acc - v.1}
}
)
>> (res)
));
```
This expresses our definitions more precisely - an expression consists of at least one term, and then
zero or many plus-or-minus terms. We don't collect them, but _fold_ them using the appropriate
operator. (It's one of those cases where Rust can't quite work out the type of the expression, so
we need a type hint.) Doing it like this establishes the correct _operator precedence_ - `*` always
wins over `+` and so forth.
We're going to need floating-point asserts here, and [there's a crate for that](http://brendanzab.github.io/approx/approx/index.html).
Add the line 'approx="0.1.1" to your Cargo.toml, and away we go:
```rust
#[macro_use]
extern crate approx;
...
assert_relative_eq!(fold_sum("1 2 3").to_result().unwrap(), 6.0);
```
Let's define a convenient little testing macro. `stringify!` turns the expression into a string
literal which we can feed into `expr` and then compare the result with how Rust would
evaluate it.
```rust
macro_rules! expr_eq {
($e:expr) => (assert_relative_eq!(
expr(stringify!($e).to_result().unwrap(),
$e)
)
}
expr_eq!(2.3);
expr_eq!(2.0 + 3.0 - 4.0);
expr_eq!(2.0*3.0 - 4.0);
```
This is very cool - a few lines to get an expression evaluator! But it gets better.
We add an alternative to numbers in the `factor` parser - expressions contained inside
parentheses:
```rust
named!(factor<&str,f64>,
alt!(
ws!(float64) |
ws!(delimited!( tag_s!("("), expr, tag_s!(")") ))
)
);
expr_eq!(2.2*(1.1 + 4.5)/3.4);
expr_eq!((1.0 + 2.0)*(3.0 + 4.0*(5.0 + 6.0)));
```
The coolness is that expressions are now defined _recursively_ in terms of expressions!
The particular magic of `delimited!` is that parentheses may be nested - Nom makes sure
the brackets match up.
We are now way past the capabilities of regular expressions, and the stripped executable at 0.5Mb
is still half the size of a "hello world" regex program.