#### Constituency/Context-Free/Phrase-Structure Grammars for English

A `constituency/context-free grammar` allows us to describe the `syntax structure` of sentences from a `language` in a systematic and formal way.  We will define a language as the set of all possible grammatical sentences and we define a grammar as a set of `rules`, also called `productions`, that can (recursively) `generate` all sentences from the language.

The key idea is that a group of words can combine to form a single unit, called a "constituent" or "phrase". E.g. In the sentence "A large brown bear caught a fish.", one of the constituents is "A large brown bear" also called a `noun phrase (NP)`. We can substitue this phrase with "He" and still get a grammatically valid sentence "He caught a fish". In a similar way, a group of constituents can be combined to form a new constituent (so this process is recursive). 

More formally, describing a sentence as a hierarchical structure of constituents, called a `parse tree`, can be done using a context-free grammar (CFG), which is a set of productions, and a `lexicon` (which is a set of words/symbols in the language).

A context-free grammar $G$ is defined as the 4-tuple ($N$, $\Sigma$, $R$, $S$) where 

$N$ is a set of `non-terminal symbols` (such as verb-phrases, noun-phrases, part-of-speech tags), 

$\Sigma$ is a set of `terminal symbols` (such as words and punctuation symbols), 

$R$ is a set of productions of the form $A \to B$, where $A \in N$ and $B$ is a string of symbols from $N$ and $\Sigma$

and $S \in N$ is a designated `start of sentence symbol`. 

In [1]:
import nltk
from nltk.parse.generate import generate

NLTK provides a nice module for defining and working with CFGs. We will use it to demonstrate a toy example. 

In [11]:
# define a simple toy grammar (lexicon of 4 words and only 3 different parts of speech)
toy_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    VP -> V NP 
    NP -> DT NN 
    V -> "eats"
    NN -> "cow" | "grass"
    DT -> "the"
    """)

# now we can use this grammer to generate the language decribed by it
language = []
for sentence in generate(toy_grammar):
    language.append(sentence)
    print(sentence)     

['the', 'cow', 'eats', 'the', 'cow']
['the', 'cow', 'eats', 'the', 'grass']
['the', 'grass', 'eats', 'the', 'cow']
['the', 'grass', 'eats', 'the', 'grass']


For our toy grammer, we have:

$N = \{ S, NP, VP, V, NN, DT\}$

$\Sigma = \{'cow', 'eats', 'grass', 'the'\}$

In [10]:
# let's take a look at each product, and see what the left and right hand sides are
terminals = set()
non_terminals = set()
for production in toy_grammar.productions():
    print(production)
    print(f"\tLHS: {production.lhs()}")
    print(f"\tRHS: {production.rhs()}")
    non_terminals.add(production.lhs())
    for symbol in production.rhs():
        if nltk.grammar.is_terminal(symbol):
            terminals.add(symbol)
        else:
            non_terminals.add(symbol)

print(f"Non-terminals: {non_terminals}")
print(f"Terminals: {terminals}")


S -> NP VP
	LHS: S
	RHS: (NP, VP)
VP -> V NP
	LHS: VP
	RHS: (V, NP)
NP -> DT NN
	LHS: NP
	RHS: (DT, NN)
V -> 'eats'
	LHS: V
	RHS: ('eats',)
NN -> 'cow'
	LHS: NN
	RHS: ('cow',)
NN -> 'grass'
	LHS: NN
	RHS: ('grass',)
DT -> 'the'
	LHS: DT
	RHS: ('the',)
Non-terminals: {V, NN, DT, S, NP, VP}
Terminals: {'grass', 'eats', 'the', 'cow'}


Note that this toy grammer is in `Chomsky Normal Form (CNF)`, i.e. each production is either of the form $A \to B C$ or $A \to a$, where $A,B,C \in N$ and $a \in \Sigma$.