## Grammar
A grammar is a set of rules describing specifically how syntactic units (sentences, phrases, etc) in a given language should be deconstructed into their constituent units.

| Symbol | Syntactic Category |
|--------|--------------------|
| S | Sentence |
| NP | Noun Phrase |
| VP | Verb Phrase |
| PP | Prepositional Phrase |
| DT | Determiner |
| N | Noun |
| V | Verb |
| ADJ | Adjective |
| P | Preposition |
| TV | Transitive Verb |
| IV | Intransitive Verb |

### Context-Free Grammars
Is a set of rules for combining syntactic components to form sensical strings.

For instance:
* The noun phrase "the castle" has a determiner (`DT`) and a noun (`N`).
* The prepositional phrase (`PP`) "in the castle" has a preposition (`P`) and a noun phrase (`NP`)
* The verb phrase (`VP`) "looks in the castle" has a verb (`V`) and a prepositional phrase (`PP`)
* The sentence (`S`) "Gwen looks in the castle" has a proper noun (`NNP`) and verb phrase (`VP`).

Using these tags, we can define a context-free grammar:

In [1]:
GRAMMAR = """
    S -> NNP VP
    VP -> V PP
    PP -> P NP
    NNP -> 'Gwen' | 'George'
    V -> 'looks' | 'burns'
    P -> 'in' | 'for'
    DT -> 'the'
    N -> 'castle' | 'ocean'
"""

In NLTK, `nltk.grammar.CFG` is an object that defines a context-free grammar, specifying how different syntactic components can be related. We can use `CFG` to parse our grammar as a string.

In [3]:
from nltk import CFG
cfg = CFG.fromstring(GRAMMAR)

print(cfg)
print(cfg.start)
print(cfg.productions())

Grammar with 12 productions (start state = S)
    S -> NNP VP
    VP -> V PP
    PP -> P NP
    NNP -> 'Gwen'
    NNP -> 'George'
    V -> 'looks'
    V -> 'burns'
    P -> 'in'
    P -> 'for'
    DT -> 'the'
    N -> 'castle'
    N -> 'ocean'
<bound method CFG.start of <Grammar with 12 productions>>
[S -> NNP VP, VP -> V PP, PP -> P NP, NNP -> 'Gwen', NNP -> 'George', V -> 'looks', V -> 'burns', P -> 'in', P -> 'for', DT -> 'the', N -> 'castle', N -> 'ocean']


### Syntactic Parsers
Once we have defined a grammar, we need a mechanism to systematically search out the meaningful syntactic structures from our corpus; this is the role of the *parser*.

If a grammar defines the search criterion for "meaningfulness" in the context of our language, the parser executes the search. A *syntactic parser* is a program that deconstructs sentences into a parse tree, which consists of hierarchical constituents, or syntactic categories.

When a parser encounters a sentence, it checks to see if the structure of that sentence conforms to a known grammar. If so, it parses the sentence according to the rules of that grammar, producing a parse tree. Parsers are often used to identify important structures, like the subject and object verbs in a sentence, or to determine which sequence of words in a sentence should be grouped together within each syntactic category.

First, we define a `GRAMMAR` to identify sequences of text that match a part-of-speech pattern, and then instantiate an NLTK `RegexParser` that uses our grammar to chunk the text into subsections:

In [21]:
import nltk

In [12]:
from nltk.chunk.regexp import RegexpParser

GRAMMAR = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
chunker = RegexpParser(GRAMMAR)

In [24]:
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document) 
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    return sentences

In [13]:
text = "Dusty Baker proposed a simple solution to the Washington National's \
early-season bullpen troubles Monday afternoon and it had nothing to do with \
his maligned group of relievers."

In [26]:
tagged_sentences = ie_preprocess(text)
result = chunker.parse(tagged_sentences[0])

In [32]:
print(result)

(S
  (KT Dusty/NNP Baker/NNP)
  proposed/VBD
  a/DT
  (KT simple/JJ solution/NN)
  to/TO
  the/DT
  (KT Washington/NNP National/NNP)
  's/POS
  (KT
    early-season/JJ
    bullpen/NN
    troubles/NNS
    Monday/NNP
    afternoon/NN)
  and/CC
  it/PRP
  had/VBD
  (KT nothing/NN)
  to/TO
  do/VB
  with/IN
  his/PRP$
  maligned/VBN
  (KT group/NN of/IN relievers/NNS)
  ./.)
