Skip to content

Commit

Permalink
Improvements to README files.
Browse files Browse the repository at this point in the history
  • Loading branch information
richard-lyman committed Jul 27, 2012
1 parent ee60305 commit 7609954
Show file tree
Hide file tree
Showing 2 changed files with 328 additions and 59 deletions.
175 changes: 144 additions & 31 deletions README.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,13 @@
<h2>Table of Contents</h2>

* Name and Pronunciation
* Introduction
* Grammar Definitions
* Introduction to Reading Grammars (or, Don't Be Afraid)
* Introduction to Using Grammars (or, It's Also Easy)
* Relation to clj-peg
* Reading Grammars
* Redefining Amotoen's Functionality
* Amotoen Grammar Definition
* Recent Improvements

<h2>Name and Pronunciation</h2>

Expand All @@ -24,14 +29,82 @@ successfully define 'shortest path' as some stream-of-consciousness process.
I tend to pronounce Amotoen with a style that follows: Am-o-toe-n.
I tend to place emphasis on the 'toe'.


<h2>Introduction</h2>
<h2>Introduction to Reading Grammars (or, Don't Be Afraid)</h2>

Amotoen is a Clojure library that supports PEG style definitions of grammars that can produce parsers.
A parser can help you work with different inputs.
While there are academic papers available that rigorously define PEG, I've found
that PEGs, or **P**arsing **E**xpression **G**rammar(s), are best explained by the
[related Wikipedia page](http://en.wikipedia.org/wiki/Parsing_expression_grammar).

There is a grammar provided in Amotoen for working with CSV files.
You should not be afraid of reading it or of working with CSV files.
Let's walk through a basic CSV grammar as provided in the com.lithinos.amotoen.grammars.csv package.

{
:Document [:Line '(* :Line) :$]
:Line [:_* :Value '(* [:_* \, :_* :Value]) :_* '(* :EndOfLine)]
:Value '(| [\" (* :DoubleQuotedValue) \"]
[\' (* :SingleQuotedValue) \']
(* :VanillaValue))
:DoubleQuotedValue '(| [\\ \"] [\\ \\] (% \"))
:SingleQuotedValue '(| [\\ \'] [\\ \\] (% \'))
:VanillaValue ['(! :EndOfLine) '(% \,)]
:_* '(* :Whitespace)
:Whitespace '(| \space \tab)
:EndOfLine '(| \newline \return)
}

The above grammar reads as follows:
- A Document is a Line followed by zero or more lines and then the end of the input
- A Line, ignoring the chunks of whitespace, is a Value followed by zero or more pairs of commas and Values and then the end of the line
- A Value is either some Double Quoted Values (wrapped in double quotes), some Single Quoted Values (wrapped in single quotes), or some Vanilla Values
- A Double Quoted Value is either an escaped double quote, or an escaped backslash, or anything else as long as it isn't a double quote
- A Single Quoted Value is either an escaped single quote, or an escaped backslash, or anything else as long as it isn't a single quote
- A Vanilla Value, first making sure that we're not at the end of the line, is anything that isn't a comma

You should now be able to read the three rules that work with whitespace, the :Whitepsace, :EndOfLine, and the rule for zero-or-more :Whitespace.

<h2>Introduction to Using Grammars (or, It's Also Easy)</h2>

Using a grammar like the CSV one above can produce one of two types of outputs.
The more common output would be for a provider of a grammar to process the intermediate format and produce some kind of more relevant one,
The package com.lithinos.amotoen.grammars.csv provides a 'to-clj' function that produces a Clojure data structure representing the CSV file.
An example of this format using the 'to-clj' function and an input of just "a,b,c\nx,y,z" is as follows:
[["a" "b" "c"] ["x" "y" "z"]]

The less common output is an AST, or Abstract Syntax Tree.
You **shouldn't have to care about this first kind of output** unless you want to, since it's meant as an intermediate format.
An example of this format using the CSV grammar and an input of just "a" is as follows:
{
:Document [
{
:Line [
{:_* ()}
{:Value {:VanillaValue [true \a]}}
()
{:_* ()}
()
]
}
()
:$
]
}

If the more common output is not available, or you'd like to work with an AST, you can generate your own AST.
The following is only one of several different paths to producing an AST from a grammar and an input.
One way is to call the Amotoen function 'pegasus' and provide three arguments.
The first argument to pegasus is the key for the root rule in the grammar, the rule that should be run first.
The second argument is the grammar definition.
The third argument is something that fulfils the IAmotoen protocol, and the provided 'wrap-string' function will do just that for Strings.
Putting all these pieces together with the input from above of "a" you get:
(pegasus :Document grammar (wrap-string "a"))

Much more information can be found in the tests for CSV, the com.lithinos.amotoen.test.csv package.

<h2>Relation to clj-peg</h2>

The clj-peg library was a predecessor to Amotoen and as such, Amotoen syntax might be reminiscent
of the syntax in clj-peg. There are, however, significant differences between
using clj-peg and using Amotoen. The most significant of those differences can be
Expand All @@ -45,30 +118,73 @@ result in far greater ease of use as well as increased maintainence.

In other words: **Amotoen is better than clj-peg. Amotoen is not AOT'd**.

<h2>Grammar Definitions</h2>
<h2>Reading Grammars</h2>

The grammar for Amotoen grammars is:
The number one goal for Amotoen is to have readable grammars.
Once you learn a few rules you should be able to read grammars.
- Grammars are a Map with Keywords as keys, so it should be easy to lookup a rule
- Vectors represent elements that must all match in sequence
- Lists allow a wider range of defining how to treat elements in the list, all based on the first element in the list
- A * (zero-or-more) requires zero or more of the rest of the list to match
- A | (either) requires at least one of the rest of the list to match
- A % (any-not) allows anything to match as long as it isn't matched by the rest of the list
- An ! (not-predicate) checks to ensure that the rest of the list doesn't match

There are special types of lists that allow you to inject a function call into the process.
Lists that have an 'a' as the first element will contain an 'Aware Function'.
Lists that have an 'f' as the first element will contain a simpler 'Function'.

<h2>Redefining Amotoen's Functionality</h2>

Aware Functions ('a') are slightly more complicated than simpler Functions ('f') but bring a **significantly** greater impact.
Simpler Functions must accept a single parameter and whatever they return are placed in the output at the point the function is called.
The single parameter given to simpler Functions is the result of Amotoen processing the remaining element in the list.
Aware Functions must accept two parameters and whatever they return is placed in the output at the point the function is called.
The two parameters given to Aware Functions are the grammar and the wrapped input.

Since Aware Functions are provided the grammar and the wrapped input, they can extend the functionality of Amotoen in any way.
If you dislike that the only terminals allowed by default are characters, then inject an Aware Function that allows something else.
Bits or Bytes or Regexs or whatever as terminals would be fairly simple to add.
If you dislike that there is a limited set of list types, then inject an Aware Function that allows something else.
An And-Predicate, a One-Or-More grouping, or a simpler combination of Not-Predicate guards followed by some :Body would be fairly simple to add.

{
:Whitespace '(| \space \newline \tab \,)
:_* '(* :Whitespace)
:_ [:Whitespace '(* :Whitespace)]
:Grammar [\{ :_* :Rule '(* [:_ :Rule]) :_* \}]
:Rule [:Keyword :_ :Body]
:Keyword [\: :ValidKeywordChar '(* :ValidKeywordChar)]
:Body '(| :Keyword :Char :Grouping)
:Grouping '(| :Sequence :Either :ZeroOrMore :ZeroOrOne :AnyNot)
:Sequence [\[ :_* :Body '(* [:_* :Body]) :_* \]]
:Either [\( \| :_ :Body '(* [:_* :Body]) :_* \)]
:ZeroOrMore [\( \* :_ :Body :_* \)]
:ZeroOrOne [\( \? :_ :Body :_* \)]
:AnyNot [\( \% :_ '(| :Keyword :Char) :_* \)]
:Char [\\ '(| :TabChar :SpaceChar :NewlineChar (% \space))]
:TabChar (pegs "tab")
:SpaceChar (pegs "space")
:NewlineChar (pegs "newline")
:ValidKeywordChar (lpegs '| "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789:/*+!_?-")
}
An 'f' type list has three elements that matter and they occur in this order:
1. The first element is an \f
2. The second element is the function
3. The third element is a :Body, as defined in the Amotoen grammar

An 'a' type list has only two elements that matter and they occur in this order:
1. The first element is an \a
2. The second element is the function

<h2>Amotoen Grammar Definition</h2>

The grammar for Amotoen grammars is:
{
:_* '(* :Whitespace)
:_ [:Whitespace '(* :Whitespace)]
:Grammar [\{ :_* :Rule '(* [:_ :Rule]) :_* \}]
:Rule [:Keyword :_ :Body]
:Keyword [\: '(| :AmotoenSymbol :ProvidedSymbol)]
:ProvidedSymbol '(| :EndOfInput :AcceptAnything)
:EndOfInput \$
:AcceptAnything \.
:Body '(| :Keyword :Char :Grouping :NotPredicate :AnyNot :AwareFunction :Function)
:Grouping '(| :Sequence :Either :ZeroOrMore)
:Sequence [\[ :_* :Body '(* [:_* :Body]) :_* \]]
:Either [\( \| :_ :Body '(* [:_* :Body]) :_* \)]
:NotPredicate [\( \! :_ :Body :_* \)]
:ZeroOrMore [\( \* :_ :Body :_* \)]
:AnyNot [\( \% :_ :Body :_* \)]
:AwareFunction [\( \a :_ :CljReaderFn :_* \)]
:Function [\( \f :_ :CljReaderFn :_ :Body :_* \)]
:CljReaderFn [\# \< '(% \>) '(* (% \>)) \>]
:Whitespace '(| \space \newline \return \tab \,)
:Char [\\ (list '| (pegs "tab") (pegs "space") (pegs "newline") (pegs "return") '(% \space))]
:AmotoenSymbol [:NonNumericCharacter '(* :AlphanumericCharactersPlus)]
:NonNumericCharacter (list '% (lpegs '| "0123456789"))
:AlphanumericCharactersPlus (lpegs '| "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789:/*+!-_?.")
}

<h2>Recent Improvements</h2>

Expand All @@ -80,7 +196,7 @@ As an example:
- Some function named 'custom-collapse' set to
#(apply str %)
- Some grammar 'g' set to
{:S [(list custom-collapse (pegs "abcabc"))]}
{:S [(list 'f custom-collapse (pegs "abcabc"))]}
- Some input 'i' set to
"abcabc"
- An invocation like...
Expand All @@ -98,7 +214,7 @@ Another example:
- Some function named 'custom-collapse' set to
#(apply str %)
- Some grammar 'g' set to
{:S [(list custom-collapse '* (lpegs '| "abc"))]}
{:S [(list 'f custom-collapse '(* (lpegs '| "abc")))]}
- Some input 'i' set to
"aabbcc"
- An invocation like...
Expand All @@ -112,6 +228,3 @@ Without supplying a custom collapse function:
- Other things alike, the result should be
{:S [(\a \a \b \b \c \c)]}




Loading

0 comments on commit 7609954

Please sign in to comment.