# First And Follow
If you plan to implement own parser for a [context-free grammar](https://en.wikipedia.org/wiki/Context-free_grammar), construction of FIRST and FOLLOW sets will be the first algorithm you will have to spend your time on.

And since definition of formal languages and grammars is much more complicated than we need at the moment, I’ll stick to the intuitive explanation.

$$S \rightarrow A$$
$$A \rightarrow A1$$
$$A \rightarrow 1$$

Context-free grammar is a set of production rules that consists of terminals and non-terminals. Terminal is a “physical” character that may occur in the parsed text. Non-terminal is a “meta” character that must occur on the left side of production rule.

In the grammar above, $S$, $A$ are non-terminals, while $1$ is terminal.

$$ S \rightarrow A \rightarrow A1 \rightarrow A11 \rightarrow 111$$

We start with a starting symbol $S$ and repetitively rewrite non-terminals using production rules. Once the sequence contains only terminals, we get a word that is accepted by the grammar.

$$S \rightarrow A$$
$$A \rightarrow 1A$$
$$A \rightarrow 1$$

Notice that this grammar is different from the first one, yet, it produces the same language.

$$ S \rightarrow A \rightarrow A1 \rightarrow A11 \rightarrow 111$$

Due to the second rule in each grammar, the first one is called left-recursive and the second one is called right-recursive.

$$S \rightarrow A$$
$$A \rightarrow 1A$$
$$A \rightarrow A1$$
$$A \rightarrow 1$$

The third grammar also describes the same language. It is both, left-recursive and right-recursive [which is not a good thing]. And what’s worse, it is ambiguous, since it can produce the same word by a different productions.

$$S \rightarrow A \rightarrow A1 \rightarrow A11 \rightarrow 111$$
$$S \rightarrow A \rightarrow 1A \rightarrow 11A \rightarrow 111$$
$$S \rightarrow A \rightarrow A1 \rightarrow 1A1 \rightarrow 111$$

Ambiguity is very bad and there are different types of parsers that deal with these kinds of problems in a different ways.

$$ S \rightarrow A $$
$$ A \rightarrow (A) $$
$$ A \rightarrow () $$
$$ S \rightarrow A \rightarrow (A) \rightarrow ((A)) \rightarrow (((...)))$$

This last grammar is able to generate any finite word with matching parentheses.

If you know regular expressions, you are probably aware that it is not possible to write regex that would match any number of parentheses. This is what makes context-free grammars so important for compilers — it is the simplest grammar that is strong enough to describe syntax of programming language.

Now, back to the algorithm.

The FIRST set enumerates possible terminals that a non-terminal may begin with.

The FOLLOW set enumerates possible terminals that a non-terminal may be followed by.

Check the two examples I have provided at the end of this article.

When you build your parser, either it is SLR, LALR, LR(k) or LL(k), you will need to construct the FIRST and FOLLOW sets. These sets are used to build a parsing table to control a finite state automaton processing the language.

## algorithm

In [1]:
def first_and_follow(grammar):
    # first & follow sets, epsilon-productions
    first = {i: set() for i in grammar.nonterminals}
    first.update((i, {i}) for i in grammar.terminals)
    follow = {i: set() for i in grammar.nonterminals}
    epsilon = set()

    while True:
        updated = False
        
        for nt, expression in grammar.rules:
            # FIRST set w.r.t epsilon-productions
            for symbol in expression:
                updated |= union(first[nt], first[symbol])
                if symbol not in epsilon:
                    break
            else:
                updated |= union(epsilon, {nt})
                
            # FOLLOW set w.r.t epsilon-productions
            aux = follow[nt]
            for symbol in reversed(expression):
                if symbol in follow:
                    updated |= union(follow[symbol], aux)
                if symbol in epsilon:
                    aux = aux.union(first[symbol])
                else:
                    aux = first[symbol]
        
        if not updated:
            return first, follow, epsilon

In [2]:
def union(first, begins):
    n = len(first)
    first |= begins
    return len(first) != n

In [3]:
class Grammar:
    
    def __init__(self, *rules):
        self.rules = tuple(self._parse(rule) for rule in rules)

    def _parse(self, rule):
        return tuple(rule.replace(' ', '').split('::='))
        
    def __getitem__(self, nonterminal):
        yield from [rule for rule in self.rules if rule[0] == nonterminal]
        
    @staticmethod
    def is_nonterminal(symbol):
        return symbol.isalpha() and symbol.isupper()
        
    @property
    def nonterminals(self):
        return set(nt for nt, _ in self.rules)
        
    @property
    def terminals(self):
        return set(
            symbol
            for _, expression in self.rules
            for symbol in expression
            if not self.is_nonterminal(symbol)
        )

## left-recursive grammar w/ epsilon-production

In [4]:
first, follow, epsilon = first_and_follow(Grammar(
    '^ ::= A $',
    'A ::= ABBC',
    'A ::= B',
    'A ::= 1',
    'B ::= C',
    'B ::= 2',
    'C ::= 3',
    'C ::= ',
))

In [5]:
first

{'C': {'3'},
 'B': {'2', '3'},
 '^': {'$', '1', '2', '3'},
 'A': {'1', '2', '3'},
 '2': {'2'},
 '1': {'1'},
 '3': {'3'},
 '$': {'$'}}

In [6]:
follow

{'C': {'$', '2', '3'}, 'B': {'$', '2', '3'}, '^': set(), 'A': {'$', '2', '3'}}

In [7]:
epsilon

{'A', 'B', 'C'}

## arithmetic expressions

In [8]:
first, follow, epsilon = first_and_follow(Grammar(
    '^ ::= E $',
    'E ::= E + T',
    'E ::= T',
    'T ::= T * F',
    'T ::= F',
    'F ::= ( E )',
    'F ::= x',
))

In [9]:
first

{'^': {'(', 'x'},
 'T': {'(', 'x'},
 'E': {'(', 'x'},
 'F': {'(', 'x'},
 '*': {'*'},
 '(': {'('},
 'x': {'x'},
 ')': {')'},
 '+': {'+'},
 '$': {'$'}}

In [10]:
follow

{'^': set(),
 'T': {'$', ')', '*', '+'},
 'E': {'$', ')', '+'},
 'F': {'$', ')', '*', '+'}}

In [11]:
epsilon

set()