# Syntax

## Introduction

**Syntax:** the form or structure of the expressions, statements, and
program units

**Semantics:** the meaning of the expressions, statements, and program
units

Syntax and semantics provide a language's definition

Users of a language definition

-   Other language designers
-   Implementers
-   Programmers (the users of the language)

## Terminology

-   A *sentence* is a string of characters over some alphabet
-   A *language* is a set of sentences
-   A *lexeme* is the lowest level syntactic unit of a language (e.g., \*, sum, begin)
-   A *token* is a category of lexemes (e.g., identifier)

## Formal Definition of Languages

**Recognizers**

-   A recognition device reads input strings over the alphabet of the language and decides whether the input strings belong to the language
-   Example: syntax analysis part of a compiler

**Generators**

-   A device that generates sentences of a language
-   One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator

## Specifying Syntax: Regular Expressions

Formal specification of syntax requires a set of rules.

Tokens can be constructed from individual characters using just three kinds of formal rules:
- concatenation
- alternation (choice among a finite set of alternatives)
- Kleene closure
    - repetition an arbitrary number of times

**Regular Expressions (RE)**

- Any set of strings that can be defined in terms of the abovee rules is called a **regular set**.
- Regular sets are generated by **regular expressions** and recognized by **scanners**.
- Unable to specify nested constructs

## Regular Expressions in F#.

In [1]:
open System.Text.RegularExpressions // Load regular expression namespace (library)

In [34]:
let rx = Regex @"the|fox"
let text = "The the quick brown fox fox jumps over the lazy dog dog."
seq{ for m in rx.Matches(text) -> (m.Value, m.Index+1)}

seq [("the", 5); ("fox", 21); ("fox", 25); ("the", 40)]

In [35]:
let rx = Regex @"(ab.)+"
let text = "bAabCabDa"
seq{ for m in rx.Matches(text) -> (m.Value, m.Index+1)}

seq [("abCabD", 3)]

## Regex Cheat Sheet

Elements
- . -  Any character
- \d - a digit
- \w - a word
- \s - a white space
- () - Captures the matched subexpression (group)
- | - Matches any one element separated by the vertical bar *(|)* character.
- \* - Matches the previous element zero or more times
- \+ - Matches the previous element one or more times
- ? - Matches the previous element zero or one times

[Full PDF](https://download.microsoft.com/download/D/2/4/D240EBF6-A9BA-4E4F-A63F-AEB6DA0B921C/Regular%20expressions%20quick%20reference.pdf)

## Specifying Syntax: Context-Free Grammars

Any set of strings that can be defined above formal rules and **recursion** is called a **context-free language (CFL)**.
- Context-free languages are generated by **context-free grammars (CFGs)** and recognized by **parsers**.


**Context-Free Grammars**

-   Developed by Noam Chomsky in the mid-1950s
-   Language generators, meant to describe the syntax of natural languages
-   Define a class of languages called context-free languages

**Backus-Naur Form (BNF)**

-   Invented by John Backus to describe the syntax of Algol 58 (1959)
-   BNF is equivalent to context-free grammars

## BNF Fundamentals

Abstractions are used to represent classes of syntactic structures
- they act like syntactic variables
- also called **nonterminal symbols,** or just **terminals**

Example:
- *expr* ⟶ id | number | - *expr* | ( *expr* ) | *expr* *op* *expr*
- *op* ⟶ + | - | * | /

**Terminals** are lexemes or tokens
- A **production rule** has a *left-hand side (LHS)*, which is a **nonterminal**, and a *right-hand side (RHS)*, which is a string of **terminals** and/or **nonterminals**

## BNF Fundamentals (cont.)

Nonterminals are often enclosed in angle brackets or capitalized

Examples of BNF rules:
- <ident\_list\> ⟶ identifier \| identifier, <ident\_list\>
- <if\_stmt\> ⟶ **if** <logic\_expr\> **then** <stmt\>

**Grammar:** a finite non-empty set of rules

A *start symbol* is a special element of the nonterminals of a grammar

## BNF Rules

An abstraction (or nonterminal symbol) can have more than one RHS
-   <stmt\> ⟶ <single\_stmt\> \| begin <stmt\_list\> end

## Describing Lists

Syntactic lists are described using recursion
- <ident\_list\> ⟶ ident \| ident, <ident\_list\>

A **derivation** is a repeated application of rules, starting with the **start symbol** and ending with a **sentence** (all terminal symbols)

## Grammar Example

- <program\> ⟶ <stmts\>
- <stmts\> ⟶ <stmt\> \| <stmt\> ; <stmts\>
- <stmt\> ⟶ <var\> = <expr\>
- <var\> ⟶ a \| b \| c \| d
- <expr\> ⟶ <term\> + <term\> \| <term\> - <term\>
- <term\> ⟶ <var\> \| const

## Derivation


- <program\> =\> <stmts\> =\> <stmt\>
-   =\> <var\> = <expr\>
-   =\> a = <expr\>
-   =\> a = <term\> + <term\>
-   =\> a = <var\> + <term\>
-   =\> a = b + <term\>
-   =\> a = b + const

## Derivations

- Every string of symbols in a derivation is a **sentential form**
- A **sentence** is a sentential form that has only terminal symbols
- A **leftmost derivation** is one in which the leftmost nonterminal in each sentential form is the one that is expanded
- A derivation may be neither leftmost nor rightmost
- metasymbol "=>*" means "derives after zero or more replacements"
    - <program\> =\>* a = b + const

## Parse Tree

A **parse tree** is a hierarchical representation of a **derivation**.

![parse-tree](img/parse-tree.png)

## Ambiguity in Grammars

A grammar is *ambiguous* if and only if it generates a sentential form that has two or more distinct parse trees
- ambiguity is a problem for parsers

Example:
- <expr\> ⟶ <expr\> <op\> <expr\> \| const
- <op\> ⟶ / \| -    

<img src="img/ambiguous-grammar.png" style="height:400px"/>

## Unambiguous Expression Grammar


If we use the parse tree to indicate **precedence levels** of the operators, we cannot have ambiguity.
- Precedence tells us that some operations group more tightly than others

Example:
- <expr\> ⟶ <expr\> - <term\> \| <term\>
- <term\> ⟶ <term\> / const \| const

![Unambiguous](img/Unambiguous.png)

## Associativity of Operators

Operator associativity can also be **indicated** by a grammar
- Associativity tells us that the operators in most languages group left to right

Example:
- <expr\> ⟶ <expr\> + <expr\> \| const (ambiguous)
- <expr\> ⟶ <expr\> + const \| const (unambiguous)

![Associativity](img/Associativity.png)

## Gammar Ambiguity Example: Dangling ELSE

C/C++ syntax specifies a conditional expression as follows:

- **if** <logic_expr\> <stmt\> [**else** <stmt\>]
    
Which can be directry translated to the following rule:

- <cond_stmt\> ⟶ **if** <logic_expr\> <stmt\> | **if** <logic_expr\> <stmt\> **else** <stmt\>

If we also have add the rule <stmt\> ⟶ <cond_stmt> to the grammar, it will become ambiguous.

Sentatial form that has this ambiguity:

-  **if** <logic_expr\> **if** <logic_expr\> <stmt\> **else** <stmt\>

## Gammar Ambiguity Example: Match Dangling ELSE

Statements *must* be distinguished between those that are **matched** and those that are **unmatched**.
- where unmatched statements are **else**-less *if*'s and all other statements are matched.

Unambiguous grammar:

- <cond_stmt\> ⟶ <matched\> | <unmatched\>
- <matched\> ⟶ **if** <logic_expr\> <matched\> **else** <matched\> | <any_non-if_statement\>
- <unmatched\> ⟶ **if** <logic_expr\> <stmt\> | **if** <logic_expr\> <matched\> **else** <unmatched\>
-  <stmt\> ⟶ <cond_stmt\>

## Extended BNF

Optional parts are placed in brackets \[ \]
-  <proc\_call\> ⟶ ident \[(<expr\_list\>)\]

Alternative parts of RHSs are placed inside parentheses and separated via vertical bars
- <term\> ⟶ <term\> (+\|-) const

Repetitions (0 or more) are placed inside braces { }
- <ident\> ⟶ letter {letter\|digit}

## BNF and EBNF

BNF

    <expr> ⟶ <expr> + <term>
           | <expr> - <term>
           | <term>

    <term> ⟶ <term> * <factor>
           | <term> / <factor>
           | <factor>
           
EBNF

    <expr> ⟶ <term> {(+ | -) <term>}
    <term> ⟶ <factor> {(* | /) <factor>}

## Static Semantics

- Indirectly related to the meaning of programs during execution
    - it has to do with the legal forms of programs
    - e.g. a syntax rule that is difficult to specify
        - In Java, a floating-point value cannot be assigned to an integer type variable, although the opposite is legal.
- Named that way because the analysis required to check these specifications can be done at compile time.
-   Context-free grammars (CFGs) cannot describe all of the syntax of programming languages
-   Categories of constructs that are trouble:
    - Context-free, but cumbersome (e.g., types of operands in expressions)
    - Non-context-free (e.g., variables must be declared before they are used)

## Attribute Grammars

**Attribute grammars** (AGs) have additions to CFGs to carry some semantic info on parse tree nodes

Primary value of AGs:

- Static semantics specification
- Compiler design (static semantics checking)

## Attribute Grammars : Definition

**Definition:** An attribute grammar is a context-free grammar **G** with the following additions:

- For each grammar symbol **x** there is a set **A(x)** of attribute values
- Each rule has a set of functions that define certain attributes of the nonterminals in the rule
- Each rule has a (possibly empty) set of predicates to check for attribute consistency

## Attribute Grammars: Definition (cont.)

- Let *X0 ⟶ X1 ... Xn* be a rule
- Functions of the form *S(X0) = f(A(X1), ... ,* *A(Xn))* define **synthesized attributes**
    - pass semantic information up a parse tree
- Functions of the form *I(Xj) = f(A(X0), ... , A(Xn)), for i <= j <= n*, define **inherited attributes**
    - pass semantic information down and across a tree
- Initially, there are *intrinsic attributes* on the leaves

## Attribute Grammars: An Example

Syntax

- <assign\> ⟶ <var\> = <expr\>
- <expr\> ⟶ <var\> + <var\> \| <var\>
- <var\> ⟶ A \| B \| C

Attributes:
- actual\_type: synthesized for <var\> and <expr\>
- expected\_type: inherited for <expr\>

## Attribute Grammars: An Example (cont.)

Production 1:
- Syntax rule: <assign\> ⟶ <var\> = <expr\>
- Semantic rule: <expr\>.expected_type  ⟵ <var\>.actual_type

Production 2:
- Syntax rule: <expr\> ⟶ <var\>\[1\] + <var\>\[2\]
- Semantic rule: <expr\>.actual\_type ⟵ <var\>\[1\].actual\_type
- Predicate:
    - <var\>\[1\].actual\_type == <var\>\[2\].actual\_type
    - <expr\>.expected\_type == <expr\>.actual\_type
    
Production 3:    
- Syntax rule: <var\> ⟶ id
- Semantic rule: <var\>.actual\_type ⟵ lookup (<var\>.string)

## Attribute Grammars (cont.)

How are attribute values computed?

- If all attributes were inherited, the tree could be decorated in top-down order.
- If all attributes were synthesized, the tree could be decorated in bottom-up order.
- In many cases, both kinds of attributes are used, and it is some combination of top-down and bottom-up that must be used.

## Attribute Grammars (cont.)

- <expr\>.expected\_type ⟵ inherited from parent
- <var\>\[1\].actual\_type ⟵ lookup (A)
- <var\>\[2\].actual\_type ⟵ lookup (B)
- <var\>\[1\].actual\_type =? <var\>\[2\].actual\_type
- <expr\>.actual\_type ⟵ <var\>\[1\].actual\_type
- <expr\>.actual\_type =? <expr\>.expected\_type