### Shallow Parsing or Chunking
<i>Shallow parsing</i>, also known as <i>light parsing</i> or <i>chunking</i>, is a technique of analyzing the structure of a sentence to break it down into its smallest constituents, which are tokens like words, and group them together into higher-level phrases. In shallow parsing, there
is more focus on identifying these phrases or chunks rather than diving into further details of the internal syntax and relations inside each chunk.

Based on the hierarchy we depicted earlier, groups of words make up phrases. There are five major categories of phrases:
- Noun phrase (NP)
- Verb phrase (VP)
- Adjective phrase (ADJP)
- Adverb phrase (ADVP)
- Prepositional phrase (PP)

#### Building Shalow Parsers
We use several techniques like regular expressions and tagging based learners to build
our own shallow parsers. The treebank corpus is available in NLTK with chunk annotations. We load it and then prepare our training and testing datasets.

Using the process of <i>chunking</i>, we can use and specify specific patterns to identify what we would want to chunk or segment in a sentence, such as phrases based on specific metadata. <i>Chinking</i> is the reverse of chunking, where we specify which specific tokens we do not want to be a part of any chunk and then form the necessary chunks excluding these tokens.

In [17]:
from nltk.corpus import treebank_chunk
data = treebank_chunk.chunked_sents()

train_data = data[:3500]
test_data = data[3500:]
print(test_data[5])

(S
  (NP Upjohn/NNP)
  ,/,
  (NP a/DT rumored/VBN target/NN)
  within/IN
  (NP the/DT drug/NN industry/NN)
  ,/,
  advanced/VBD
  (NP 7\/8/CD)
  to/TO
  (NP 38/CD 7\/8/CD)
  ./.)


In [22]:
import nltk
from nltk.chunk import RegexpParser

sample_sentence = 'The brown fox is quick and he is jumping over the lazy dog'

tagged_simple_sent = nltk.pos_tag(nltk.word_tokenize(sample_sentence))

# Chunking
chunk_grammar = """
                NP: {<DT>?<JJ>*<NN.*>}
                """
rc = RegexpParser(chunk_grammar)
c = rc.parse(tagged_simple_sent)
print(c)
c.pretty_print()

(S
  (NP The/DT brown/JJ fox/NN)
  is/VBZ
  quick/JJ
  and/CC
  he/PRP
  is/VBZ
  jumping/VBG
  over/IN
  (NP the/DT lazy/JJ dog/NN))
                                                    S                                                
   _________________________________________________|_____________________________________            
  |       |       |      |      |         |         |              NP                     NP         
  |       |       |      |      |         |         |       _______|_______        _______|______     
is/VBZ quick/JJ and/CC he/PRP is/VBZ jumping/VBG over/IN The/DT brown/JJ fox/NN the/DT lazy/JJ dog/NN



In [19]:
# Chinking based on explicit chink patterns
chink_grammar = """
NP:
    {<.*>+}             # Chunk everything as NP
    }<VBZ|VBD|JJ|IN>+{  # Chink sequences of VBD\VBZ\JJ\IN
"""

chink_grammar = """
NP:
    {<.*>+}             # Chunk everything as NP
    }<VBZ|VBD|JJ|IN>+{  # Chink sequences of VBD\VBZ\JJ\IN
"""
rc = RegexpParser(chink_grammar)
c = rc.parse(tagged_simple_sent)
print(c)
c.pretty_print()

(S
  (NP The/DT)
  brown/JJ
  (NP fox/NN)
  is/VBZ
  quick/JJ
  (NP and/CC he/PRP)
  is/VBZ
  (NP jumping/VBG)
  over/IN
  (NP the/DT)
  lazy/JJ
  (NP dog/NN))
                                                  S                                                      
    ______________________________________________|__________________________________________________     
   |       |       |       |       |       |      NP     NP           NP             NP       NP     NP  
   |       |       |       |       |       |      |      |       _____|____          |        |      |    
brown/JJ is/VBZ quick/JJ is/VBZ over/IN lazy/JJ The/DT fox/NN and/CC     he/PRP jumping/VBG the/DT dog/NN



In [20]:
# a more generic shallow parser
grammar = """
NP: {<DT>?<JJ>?<NN.*>}  
ADJP: {<JJ>}
ADVP: {<RB.*>}
PP: {<IN>}      
VP: {<MD>?<VB.*>+}
"""
rc = RegexpParser(grammar)
c = rc.parse(tagged_simple_sent)
print(c)
c.pretty_print()

(S
  (NP The/DT brown/JJ fox/NN)
  (VP is/VBZ)
  (ADJP quick/JJ)
  and/CC
  he/PRP
  (VP is/VBZ jumping/VBG)
  (PP over/IN)
  (NP the/DT lazy/JJ dog/NN))
                                               S                                                         
   ____________________________________________|______________________________________________            
  |      |              NP             VP     ADJP           VP                PP             NP         
  |      |       _______|_______       |       |        _____|_______          |       _______|______     
and/CC he/PRP The/DT brown/JJ fox/NN is/VBZ quick/JJ is/VBZ     jumping/VBG over/IN the/DT lazy/JJ dog/NN



In [5]:
# Evaluating the parser
print(rc.evaluate(test_data))

ChunkParse score:
    IOB Accuracy:  46.1%%
    Precision:     19.9%%
    Recall:        43.3%%
    F-Measure:     27.3%%


We leverage two chunking utility functions—`tree2conlltags` to get triples of word, tag, and chunk tags for each token and `conlltags2tree` to generate a parse tree from these token triples. The chunk tags use a popular format, known as the IOB format. In this format, you will notice some new notations with I, O, and B prefixes, which is the popular IOB notation used in chunking. It depicts Inside, Outside, and Beginning. The B- prefix before a tag indicates it is the beginning of a chunk; the I- prefix indicates that it is inside a chunk. The O tag indicates that the token does not belong to any chunk.

In [6]:
from nltk.chunk.util import tree2conlltags, conlltags2tree

# look at a sample training tagged sentence
train_sent = train_data[7]
print(train_sent)
print('-------------------')
# get the (word, POS tag, Chunk tag) triples for each token
wtc = tree2conlltags(train_sent)
print(wtc)
print('-------------------')
# get shallow parsed tree back from the WTC triples
tree = conlltags2tree(wtc)
print(tree)
print('-------------------')

(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.)
-------------------
[('A', 'DT', 'B-NP'), ('Lorillard', 'NNP', 'I-NP'), ('spokewoman', 'NN', 'I-NP'), ('said', 'VBD', 'O'), (',', ',', 'O'), ('``', '``', 'O'), ('This', 'DT', 'B-NP'), ('is', 'VBZ', 'O'), ('an', 'DT', 'B-NP'), ('old', 'JJ', 'I-NP'), ('story', 'NN', 'I-NP'), ('.', '.', 'O')]
-------------------
(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.)
-------------------


Now that we know how these functions work, we define a function called `conll_tag_chunks()` to extract POS and Chunk tags from sentences with chunked annotations and reuse our `combined_taggers()` function from POS tagging to train multiple taggers with backoff taggers, as depicted in the following code snippet.

In [7]:
def conll_tag_chunks(chunk_sents):
    tagged_sents = [tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in sent] for sent in tagged_sents]

def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff

We now define a `NGramTagChunker` class, which will take in tagged sentences
as training input, get their WTC triples (word, POS tag, chunk tag), and train a
`BigramTagger` with a `UnigramTagger` as the backoff tagger. We also define a `parse()`
function to perform shallow parsing on new sentences.

In [8]:
from nltk.tag import UnigramTagger, BigramTagger
from nltk.chunk import ChunkParserI

class NGramTagChunker(ChunkParserI):
    def __init__(self, train_sentences, tagger_classes=[UnigramTagger, BigramTagger]):
        train_sent_tags = conll_tag_chunks(train_sentences)
        self.chunk_tagger = combined_tagger(train_sent_tags, tagger_classes)
        
    def parse(self, tagged_sentence):
        if not tagged_sentence:
            return None
        pos_tags = [tag for word, tag in tagged_sentence]
        chunk_pos_tags = self.chunk_tagger.tag(pos_tags)
        chunk_tags = [chunk_tag for (pos_tag, chunk_tag) in chunk_pos_tags]
        wpc_tags = [(word, pos_tag, chunk_tag) 
                    for ((word, pos_tag), chunk_tag) 
                    in zip(tagged_sentence, chunk_tags)]
        return conlltags2tree(wpc_tags)

In [9]:

# train the shallow parser
ntc = NGramTagChunker(train_data)

# test parser performance on test data
print(ntc.evaluate(test_data))

ChunkParse score:
    IOB Accuracy:  97.2%%
    Precision:     91.4%%
    Recall:        94.3%%
    F-Measure:     92.8%%


In [10]:
# Let’s now train and evaluate our parser on the conll2000 corpus

from nltk.corpus import conll2000
wsj_data = conll2000.chunked_sents()
train_wsj_data = wsj_data[:10000]
test_wsj_data = wsj_data[10000:]

print(train_wsj_data[10])

# train the shallow parser
tc = NGramTagChunker(train_wsj_data)
# test performance on the test data
print(tc.evaluate(test_wsj_data))

(S
  (NP He/PRP)
  (VP reckons/VBZ)
  (NP the/DT current/JJ account/NN deficit/NN)
  (VP will/MD narrow/VB)
  (PP to/TO)
  (NP only/RB #/# 1.8/CD billion/CD)
  (PP in/IN)
  (NP September/NNP)
  ./.)
ChunkParse score:
    IOB Accuracy:  89.1%%
    Precision:     80.3%%
    Recall:        86.1%%
    F-Measure:     83.1%%


In [11]:
# parse our sample sentence

import spacy

sample_sentence = 'The brown fox is quick and he is jumping over the lazy dog'

nlp = spacy.load('en_core_web_md')
doc = nlp(sample_sentence)

tagged_sentence = [(word.text, word.tag_) for word in doc]
tree = ntc.parse(tagged_sentence)
print(tree)
tree.draw()

(S
  (NP The/DT brown/JJ fox/NN)
  is/VBZ
  (NP quick/JJ)
  and/CC
  (NP he/PRP)
  is/VBZ
  jumping/VBG
  over/IN
  (NP the/DT lazy/JJ dog/NN))


### Dependency Parsing
In dependency-based parsing, we try to use dependency-based grammars to analyze and infer both structure and semantic dependencies and relationships between tokens in a sentence. Dependency grammars help us annotate sentences with dependency tags, which are one-to-one mappings between tokens signifying dependencies between them.

The basic principle behind a dependency grammar is that in any sentence in the language, all words except one have some relationship or dependency on other words in the sentence. The word that has no dependency is called the root of the sentence. The verb is taken as the root of the sentence in most cases. All the other words are directly or indirectly linked to the root verb using links , which are the dependencies.

https://universaldependencies.org/en/dep/index.html

In [12]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp("The brown fox is quick and he is jumping over the lazy dog")

for token in doc:
    print(token.text, ' <---- ', token.dep_, ' <---- ', token.head.text)
    
spacy.displacy.render(doc, style='dep',
                     jupyter=True,
                    options={'distance': 110,
                        'arrow_stroke': 2,
                        'arrow_width': 8})

The  <----  det  <----  fox
brown  <----  amod  <----  fox
fox  <----  nsubj  <----  is
is  <----  ROOT  <----  is
quick  <----  acomp  <----  is
and  <----  cc  <----  is
he  <----  nsubj  <----  jumping
is  <----  aux  <----  jumping
jumping  <----  conj  <----  is
over  <----  prep  <----  jumping
the  <----  det  <----  dog
lazy  <----  amod  <----  dog
dog  <----  pobj  <----  over


### Constituent Parsing
Constituent based grammars are used to analyze and determine the constituents that
a sentence is composed of. Besides determining the constituents, another important
objective is to determine the internal structure of these constituents and how they link
to each other. In general, constituency based grammar helps specify how we can break a sentence
into various constituents.

In [24]:
from nltk.parse.stanford import StanfordParser

sample_sentence = 'The brown fox is quick and he is jumping over the lazy dog'

scp = StanfordParser(path_to_jar='C:/Users/shres/PythonDoc/text_analytics/stanford_parser/stanford-parser-full-2015-04-20/stanford-parser.jar',
                     path_to_models_jar='C:/Users/shres/PythonDoc/text_analytics/stanford_parser/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar')

# get parse tree                   
result = list(scp.raw_parse(sample_sentence))[0]
print(result)
result.pretty_print() 

Please use [91mnltk.parse.corenlp.CoreNLPParser[0m instead.
  scp = StanfordParser(path_to_jar='C:/Users/shres/PythonDoc/text_analytics/stanford_parser/stanford-parser-full-2015-04-20/stanford-parser.jar',


(ROOT
  (NP
    (S
      (S
        (NP (DT The) (JJ brown) (NN fox))
        (VP (VBZ is) (ADJP (JJ quick))))
      (CC and)
      (S
        (NP (PRP he))
        (VP
          (VBZ is)
          (VP
            (VBG jumping)
            (PP (IN over) (NP (DT the) (JJ lazy) (NN dog)))))))))
                                ROOT                                  
                                 |                                     
                                 NP                                   
                                 |                                     
                                 S                                    
            _____________________|____                                 
           |                 |        S                               
           |                 |    ____|___________                     
           |                 |   |                VP                  
           |                 |   |     ___________|____        