# TOC

  __Chapter 4 - Parsing structure in text__

1. [Import](#Import)
1. [Why we need parsing](#Why-we-need-parsing)
1. [Different types of parsers](#Different-types-of-parsers)
1. [A regex parser](#A-regex-parser)
1. [Dependency parsing](#Dependency-parsing)
1. [Chunking](#Chunking)
1. [Information extraction](#Information-extraction)


# Import

<a id = 'Import'></a>

In [2]:
# Standard libary and settings
import os
import sys
import importlib
import itertools
import warnings

warnings.simplefilter("ignore")
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))

# Data extensions and settings
import numpy as np

np.set_printoptions(threshold=np.inf, suppress=True)
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.options.display.float_format = "{:,.6f}".format

# Modeling extensions
import nltk

# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
sns.set_style("whitegrid")

# Why we need parsing

Parsing is useful for defining a set of rules that can be used as a tempalte to write sentence and arrange words in the proper order. As humans learn their native language in childhood, we instinctively take to the rules of our language. We try to replicate this process in NLP through text parsing.


<a id = 'Why-we-need-parsing'></a>

In [5]:
# grammar rules with a very limited vocabulary and generic rules
# some example sentences would be 'president eats apple' and 'obama drinks coke'
from nltk import CPG

to_grammar = nltk.CPG.fromstring(
    """
S -> NP VP
VP -> V NP
V -> "eats" | "drinks"
NP -> Det N
Det -> "a" | "an" | "the"
N -> "president" | "Obama" | "apple" | "coke"
"""
)

ImportError: cannot import name 'CPG' from 'nltk' (C:\Users\petersont\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\__init__.py)

# Different types of parsers

A parser is a procedural interpretation of grammar that searches through a defined space and finds an optimal pathway to create a sentence. There are several parsers available:

__Recursive descent parser__

This straightforward form of parsing that involves a top-down process where the parser attempt to verify that the syntax of the input stream is correct as read from left to right. A basic summary of this operation is that the process reads characters from the input stream and evaluates compliance  with the grammar rules. For example, the parser looks ahead one character and advances the the process when it gets a proper match.

__Shift-reduced parser__

This is a simple bottom-up parser, which involves comparing the left hand side of a grammar rule set and replaces the placeholders with the output specified on the right hand side of the grammar rules.

__Chart parser__

A chart paraser is a dynamic programming method that stores intermediate results and reuses them when it makes se.


<a id = 'Different-types-of-parsers'></a>

# A regex parser
A regex parser uses regular expressions uses grammar rules and POS-tagged strings. The parser uses these rules to parse sentences and generate a tree.

In the example below, regex is used to analyze the POS tag of each word. The rules define the kinds of patterns that are believed to create phrases. For example, anything that has a POS tag matching `{<DT>? <JJ>* <NN>*}`, which means it starts with a determiner, followed by an adjective, and then followed by a noun is most likely a noun phrase.

<a id = 'A-regex-parser'></a>

In [7]:
#
from nltk.chunk.regexp import *

chunk_rules = ChunkRule("<.*>+", "chunk everything")
reg_parser = RegexpParser(
    """
NP: {<DT>? <JJ>* <NN>*}     # NP
P: {<IN>}                   # preposition
V: {<V.*>}                  # verb
PP: {<P> <NP>}              # PP -> P NP
VP: {<V> <NP|PP>*}          # VP -> V (NP | PP)*
"""
)

test_sent = "Mr. Obama played a big role in the health insurance bill"
test_sent_pos = nltk.pos_tag(nltk.word_tokenize(test_sent))
parsed_out = reg_parser.parse(test_sent_pos)
print(parsed_out)

(S
  Mr./NNP
  Obama/NNP
  (VP
    (V played/VBD)
    (NP a/DT big/JJ role/NN)
    (PP (P in/IN) (NP the/DT health/NN insurance/NN bill/NN))))


# Dependency parsing

Dependency parsing deploys the philosophy that each word is connected with other words by a direct link. These links are called dependencies. Phrase structure trees capture the relationship between words and phrases, and then between phrases. Dependency trees, on the other hand, would evaluate a sentence such as "The big dog runs" and conclude, among other things, that big is dependent on dog.

<a id = 'Dependency-parsing'></a>

In [None]:
# Stanford parser
from nltk.parse.stanford import StanfordParser

english_parser = StanfordParser(
    "stanford-parser.jar", "standford-parser-3.4-models.jar"
)
english_parser.raw_parse_sents("this is the english parser test")

# Chunking

Chunking is a shallow parsing technique that tries to determine combinations of words that together constitute some meaning. A chunk can be thought of as the minimal unit needed to convey a certain message. For example, "The President speaks about the health care reforms" can be separated into two chunks. "The President" is noun dominated, and is consequently identified as a noun phrase (NP) and the remaining part of the sentence is dominated by the verb "speaks", which makes it a verb phrase (VP). Within this second component, there is a subchunk "The Health Care Reforms", which is an NP. What we have at each tier is a set of non-overlapping groups of words.


<a id = 'Chunking'></a>

In [14]:
# Basic chunking example
from nltk.chunk.regexp import *

test_sent = "The prime minister announced he had asked the chief government whip, Philip Ruddock, to call a special party room meeting for 9am on Monday to consider the spill motion."
test_sent_pos = nltk.pos_tag(nltk.word_tokenize(test_sent))

rule_vp = ChunkRule(r"(<VB.*>)?(<VB.*>)+(<PRP>)?", "Chunk VPs")
parser_vp = RegexpChunkParser([rule_vp], chunk_label="VP")
print(parser_vp.parse(test_sent_pos))

rule_np = ChunkRule(r"(<DT>?<RB>?)?<JJ|CD>*(<JJ|CD><,>)*(<NN.*>)+", "Chunk NPs")
parser_np = RegexpChunkParser([rule_np], chunk_label="NP")
print(parser_np.parse(test_sent_pos))

(S
  The/DT
  prime/JJ
  minister/NN
  (VP announced/VBD he/PRP)
  (VP had/VBD asked/VBN)
  the/DT
  chief/JJ
  government/NN
  whip/NN
  ,/,
  Philip/NNP
  Ruddock/NNP
  ,/,
  to/TO
  (VP call/VB)
  a/DT
  special/JJ
  party/NN
  room/NN
  meeting/NN
  for/IN
  9am/CD
  on/IN
  Monday/NNP
  to/TO
  (VP consider/VB)
  the/DT
  spill/NN
  motion/NN
  ./.)
(S
  (NP The/DT prime/JJ minister/NN)
  announced/VBD
  he/PRP
  had/VBD
  asked/VBN
  (NP the/DT chief/JJ government/NN whip/NN)
  ,/,
  (NP Philip/NNP Ruddock/NNP)
  ,/,
  to/TO
  call/VB
  (NP a/DT special/JJ party/NN room/NN meeting/NN)
  for/IN
  9am/CD
  on/IN
  (NP Monday/NNP)
  to/TO
  consider/VB
  (NP the/DT spill/NN motion/NN)
  ./.)


# Information extraction

A typical NLP pipeline involves these steps:

1. Raw text intake
2. Sentence tokenization (list of strings)
3. Word tokenization (list of lists of strings)
4. Part of speech tagging (tuples)
5. Named entity detection
6. Relationship extraction

The only topic not covered so far is relations extraction. Just like it sounds, this is the process of extracting relationships that exists between entities. For example authorship is a relationship that defines how a book and the writer of that book is defined.


<a id = 'Information-extraction'></a>

In [None]:
# simple pipeline
f = open()
text = f.read()
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
for sent in tagged_sentences:
    print(nltk.ne_chunk())

In [30]:
# relations extaction workflow - orgaization in location
import re

IN = re.compile(r".*\bin\b(?!\b.+ing)")

for doc in nltk.corpus.ieer.parsed_docs("NYT_19980315"):
    for rel in nltk.sem.extract_rels("ORG", "LOC", doc, corpus="ieer", pattern=IN):
        print(nltk.sem.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']


In [32]:
# relations extaction workflow - people in location
import re

IN = re.compile(r".*\bin\b(?!\b.+ing)")

for doc in nltk.corpus.ieer.parsed_docs("NYT_19980315"):
    for rel in nltk.sem.extract_rels("PERSON", "LOC", doc, corpus="ieer", pattern=IN):
        print(nltk.sem.rtuple(rel))

[PER: 'Miller'] "started talking. ``Fresh Air'' started as a local show in" [LOC: 'Philadelphia']
[PER: 'Drudge'] 'be sued in the' [LOC: 'District of Columbia']
[PER: 'Alan Brody'] ', an independent media analyst in' [LOC: 'Scarsdale']
[PER: 'Jerry Yang'] ', co-founder of the company, which is based in' [LOC: 'Santa Clara']
[PER: 'Frank'] "'s account of growing up in Depression" [LOC: 'Ireland']
[PER: 'Wilson'] 'has always stirred the strongest reactions. Especially in' [LOC: 'Europe']
[PER: 'Dominique Sanda'] ') in' [LOC: 'Milan']
[PER: 'Tania Leon'] 'in' [LOC: 'Geneva']
