### Section 190.1: Getting Started with PLY

**Note: Do not use pip to install PLY, it will install a broken distribution on your machine.**

### Section 190.2: The "Hello, World!" of PLY - A Simple Calculator

```Python
from ply import lex
import ply.yacc as yacc
tokens = (
    'PLUS',
    'MINUS',
    'TIMES',
    'DIV',
    'LPAREN',
    'RPAREN',
    'NUMBER',
)
t_ignore = ' \t'
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIV = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
def t_NUMBER( t ) :
    r'[0-9]+'
    t.value = int( t.value )
    return t
def t_newline( t ):
    r'\n+'
    t.lexer.lineno += len( t.value )
def t_error( t ):
    print("Invalid Token:",t.value[0])
    t.lexer.skip( 1 )

lexer = lex.lex()
precedence = (
    ( 'left', 'PLUS', 'MINUS' ),
    ( 'left', 'TIMES', 'DIV' ),
    ( 'nonassoc', 'UMINUS' )
)
def p_add( p ) :
    'expr : expr PLUS expr'
    p[0] = p[1] + p[3]
def p_sub( p ) :
    'expr : expr MINUS expr'
    p[0] = p[1] - p[3]
def p_expr2uminus( p ) :
    'expr : MINUS expr %prec UMINUS'
    p[0] = - p[2]
def p_mult_div( p ) :
    '''expr : expr TIMES expr
    | expr DIV expr'''
    if p[2] == '*' :
        p[0] = p[1] * p[3]
    else :
        if p[3] == 0 :
            print("Can't divide by 0")
            raise ZeroDivisionError('integer division by 0')
        p[0] = p[1] / p[3]
def p_expr2NUM( p ) :
    'expr : NUMBER'
    p[0] = p[1]
def p_parens( p ) :
    'expr : LPAREN expr RPAREN'
    p[0] = p[2]
def p_error( p ):
    print("Syntax error in input!")
    
parser = yacc.yacc()
res = parser.parse("-4*-(3-5)") # the input
print(res)
```

In [1]:
%%cmd
python calc.py

Microsoft Windows [版本 10.0.16299.309]
(c) 2017 Microsoft Corporation。保留所有权利。

E:\MyFile\Jupyter\Python-Learn\Chapter 190 Python Lex-Yacc>python calc.py
-8

E:\MyFile\Jupyter\Python-Learn\Chapter 190 Python Lex-Yacc>

Generating LALR tables


### Section 190.3: Part 1: Tokenizing Input with Lex

```Python
import ply.lex as lex
# List of token names. This is always required
tokens = [
    'NUMBER',
    'PLUS',
    'MINUS',
    'TIMES',
    'DIVIDE',
    'LPAREN',
    'RPAREN',
]
# Regular expression rules for simple tokens
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
# A regular expression rule with some action code
def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)
    return t
# Define a rule so we can track line numbers
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
t_ignore = ' \t'
# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)
# Build the lexer
lexer = lex.lex()
# Give the lexer some input
lexer.input(data)
# Tokenize
while True:
    tok = lexer.token()
    if not tok:
        break # No more input
    print(tok)
```

#### Breakdown

1. Import the module using import ply.lex 
2. All lexers must provide a list called tokens that defines all of the possible token names that can be produced by the lexer. This list is always required.
```Python
    tokens = [
        'NUMBER',
        'PLUS',
        'MINUS',
        'TIMES',
        'DIVIDE',
        'LPAREN',
        'RPAREN',
    ]
    ##tokens could also be a tuple of strings (rather than a string), where each string denotes a token as before.
```
3. The regex rule for each string may be defined either as a string or as a function. In either case, the variable
name should be prefixed by `t_` to denote it is a rule for matching tokens.
+ For simple tokens, the regular expression can be specified as strings: `t_PLUS = r'\+'`
+ If some kind of action needs to be performed, a token rule can be specified as a function.
    ```Python
    def t_NUMBER(t):
        r'\d+'
        t.value = int(t.value)
        return t
    ```
    Note, the rule is specified as a doc string within the function. The function accepts one argument which
is an instance of LexToken , performs some action and then returns back the argument.
    If you want to use an external string as the regex rule for the function instead of specifying a doc
string, consider the following example:
    ```Python
    @TOKEN(identifier) # identifier is a string holding the regex
    def t_ID(t):
        ... # actions
    ```

An instance of LexToken object (let's call this object t ) has the following attributes:
1. t.type which is the token type (as a string) (eg: 'NUMBER' , 'PLUS' , etc). By default, t.type is set to the name following the t_ prefix.
2. t.value which is the lexeme (the actual text matched) 
3. t.lineno which is the current line number (this is not automatically updated, as the lexer knows nothing of line numbers). Update lineno using a function called t_newline .
```Python
def t_newline(t):
r'\n+'
t.lexer.lineno += len(t.value)
```
4. t.lexpos which is the position of the token relative to the beginning of the input text.

* If nothing is returned from a regex rule function, the token is discarded. If you want to discard a token, you can alternatively add t\_ignore\_ prefix to a regex rule variable instead of defining a function for the
same rule.
```Python
def t_COMMENT(t):
    r'\#.*'
    pass
# No return value. Token discarded
```

`t_ignore_COMMENT = r'\#.*'`

```Python
t_ignore_COMMENT = r'\#.*'
t_ignore = ' \t' # ignores spaces and tabs
```

* When building the master regex, lex will add the regexes specified in the file as follows:

    1. Tokens defined by functions are added in the same order as they appear in the file. 
    2. Tokens defined by strings are added in decreasing order of the string length of the string defining the regex for that token.

* Literals are tokens that are returned as they are. Both t.type and t.value will be set to the character itself. Define a list of literals as such:

`literals = [ '+', '-', '*', '/' ]`
or,
`literals = "+-*/"`
It is possible to write token functions that perform additional actions when literals are matched.
However, you'll need to set the token type appropriately. For example:  

```Python
literals = [ '{', '}' ]
def t_lbrace(t):
    r'\{'
    t.type = '{' # Set token type to the expected literal (ABSOLUTE MUST if this is a literal)
    return t
```

* Handle errors with t_error function.

```Python
# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1) # skip the illegal token (don't process it)
```

**Final preparations:**
    
Build the lexer using lexer = lex.lex() .

You can also put everything inside a class and call use instance of the class to define the lexer. Eg:

In [4]:
import ply.lex as lex
class MyLexer(object):
    ... # everything relating to token rules and error handling comes here as usual
    # List of token names. This is always required
    tokens = [
        'NUMBER',
        'PLUS',
        'MINUS',
        'TIMES',
        'DIVIDE',
        'LPAREN',
        'RPAREN',
    ]
    # Regular expression rules for simple tokens
    t_PLUS = r'\+'
    t_MINUS = r'-'
    t_TIMES = r'\*'
    t_DIVIDE = r'/'
    t_LPAREN = r'\('
    t_RPAREN = r'\)'
    t_ignore = ' \t'
    # A regular expression rule with some action code
    def t_NUMBER(t):
        r'\d+'
        t.value = int(t.value)
        return t
    # Define a rule so we can track line numbers
    def t_newline(t):
        r'\n+'
        t.lexer.lineno += len(t.value)
    # A string containing ignored characters (spaces and tabs)
    
    # Error handling rule
    def t_error(t):
        print("Illegal character '%s'" % t.value[0])
        t.lexer.skip(1)
    # Build the lexer
    def build(self, **kwargs):
        self.lexer = lex.lex(module=self, **kwargs)
    def test(self, data):
        self.lexer.input(data)
        for token in self.lexer.token():
            print(token)
    # Build the lexer and try it out
m = MyLexer()
m.build() # Build the lexer
m.test("3 + 4") 

AttributeError: module '__main__' has no attribute '__file__'

```Python
for i in lexer:
    print(i)
```

### Section 190.4: Part 2: Parsing Tokenized Input with Yacc

In [2]:
# Yacc example
import ply.yacc as yacc
# Get the token map from the lexer. This is required.
from calclex import tokens
def p_expression_plus(p):
    'expression : expression PLUS term'
    p[0] = p[1] + p[3]
def p_expression_minus(p):
    'expression : expression MINUS term'
    p[0] = p[1] - p[3]
def p_expression_term(p):
    'expression : term'
    p[0] = p[1]
def p_term_times(p):
    'term : term TIMES factor'
    p[0] = p[1] * p[3]
def p_term_div(p):
    'term : term DIVIDE factor'
    p[0] = p[1] / p[3]
def p_term_factor(p):
    'term : factor'
    p[0] = p[1]
def p_factor_num(p):
    'factor : NUMBER'
    p[0] = p[1]
def p_factor_expr(p):
    'factor : LPAREN expression RPAREN'
    p[0] = p[2]
# Error rule for syntax errors
def p_error(p):
    print("Syntax error in input!")
# Build the parser
parser = yacc.yacc()
while True:
    try:
        s = raw_input('calc > ')
    except EOFError:
        break
    if not s: continue
    result = parser.parse(s)
    print(result)

LexToken(NUMBER,10,1,0)
LexToken(PLUS,'+',1,3)
LexToken(NUMBER,5,1,5)


KeyError: '__file__'

**Breakdown**

+ Each grammar rule is defined by a function where the docstring to that function contains the appropriate context-free grammar specification. The statements that make up the function body implement the semantic actions of the rule. Each function accepts a single argument p that is a sequence containing the values of
each grammar symbol in the corresponding rule. The values of p[i] are mapped to grammar symbols as shown here:
```Python
def p_expression_plus(p):
    'expression : expression PLUS term'
    # ^ ^ ^ ^
    # p[0] p[1] p[2] p[3]
    p[0] = p[1] + p[3]
```

+ For tokens, the "value" of the corresponding p[i] is the same as the p.value attribute assigned in the lexer
module. So, PLUS will have the value + .

+ For non-terminals, the value is determined by whatever is placed in p[0] . If nothing is placed, the value is
None. Also, p[-1] is not the same as p[3] , since p is not a simple list ( p[-1] can specify embedded actions
(not discussed here)).

+ The p_error(p) rule is defined to catch syntax errors (same as yyerror in yacc/bison).

+ Multiple grammar rules can be combined into a single function, which is a good idea if productions have a
similar structure.

```Python
def p_binary_operators(p):
    '''expression : expression PLUS term
    | expression MINUS term
    term : term TIMES factor
    | term DIVIDE factor'''
    if p[2] == '+':
        p[0] = p[1] + p[3]
    elif p[2] == '-':
        p[0] = p[1] - p[3]
    elif p[2] == '*':
        p[0] = p[1] * p[3]
    elif p[2] == '/':
        p[0] = p[1] / p[3]
```

+ Character literals can be used instead of tokens.

```Python
def p_binary_operators(p):
    '''expression : expression '+' term
    | expression '-' term
    term : term '*' factor
    | term '/' factor'''
    if p[2] == '+':
        p[0] = p[1] + p[3]
    elif p[2] == '-':
        p[0] = p[1] - p[3]
    elif p[2] == '*':
        p[0] = p[1] * p[3]
    elif p[2] == '/':
        p[0] = p[1] / p[3]
```
Of course, the literals must be specified in the lexer module.

+ Empty productions have the form '''symbol : '''
+ To explicitly set the start symbol, use start = 'foo' , where foo is some non-terminal.
+ Setting precedence and associativity can be done using the precedence variable.

```Python
precedence = (
    ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators
    ('left', 'PLUS', 'MINUS'),
    ('left', 'TIMES', 'DIVIDE'),
    ('right', 'UMINUS'), # Unary minus operator
)
```
Tokens are ordered from lowest to highest precedence. nonassoc means that those tokens do not associate. This means that something like a < b < c is illegal whereas a < b is still legal.

+ parser.out is a debugging file that is created when the yacc program is executed for the first time. Whenever a shift/reduce conflict occurs, the parser always shifts.