# LANGUAGE PARSING TUTORIAL

based on the courses of Codecademy


### This part is only focusing on the parsing part, don't forget to do the pre-processing steps for real-text application. 

Parsing is a stage of NLP concerned with **breaking / segmenting** up text **based on syntax**

By using Python’s _regular expression_ modulere and the _Natural Language Toolkit_, known as _NLTK_, you can find:    
* keywords of interest
* discover where and how often they are used 
* discern the parts-of-speech patterns in which they appear to understand the sometimes hidden meaning in a piece of writing

> ex:  highlight the biases of its author or uncover additional insights that even a deep, rigorous reading of the text might not reveal

### MENU: 


1) Compiling and Matching


2) Searching and Finding

3) Part-of-Speech Tagging

4) Introduction to Chunking
   - 4 - 1) Chunking Noun Phrases
   - 4 - 2) Chunking Verb Phrases¶
   - 4 - 3) Chunk filtering

5) REVIEW 

# 1) Compiling and Matching

* **.compile( )** :  
This method takes a regular expression pattern as an argument and compiles the pattern into a regular expression object, which you can later use to find matching text

* **.match( )** :   
method that takes a string of text as an argument and looks for a single match to the regular expression that starts at the beginning of the string
    * finds a match that starts at the beginning of the string, it will return a match object
    * If there is no match, .match() will return None



In [9]:
import re

# characters are defined
character_1 = "Dorothy"
character_2 = "Henry"

# regular expression object that match any 7 character string later usable to find matching text
regular_expression = re.compile("[A-Za-z]{7}") #  recognizes upper or lower case characters
print("regular_expression: ", regular_expression)

# check for a match to character_1 here
# (check if regex matches string stored in character_1)
result_1 = regular_expression.match(character_1)
print("result_1: ", result_1)


# store and print the matched text here 
match_1 = result_1.group(0)
print("match_1: ", match_1)

# compile a regular expression to match a 7 character string of word characters
# and check for a match to character_2 here
regular_expression = re.compile("[A-Za-z]{7}") # search for something len = 7
result_2 = regular_expression.match(character_2) # but is len = 5
print("result_2: ", result_2) # so nothing is found => None
# 2 last lines above can be written in one line also
result_2 = re.match("[A-Za-z]{7}", character_2)
print("result_2: ", result_2)

regular_expression:  re.compile('[A-Za-z]{7}')
result_1:  <_sre.SRE_Match object; span=(0, 7), match='Dorothy'>
match_1:  Dorothy
result_2:  None
result_2:  None


# 2) Searching and Finding

* **.search( )** :   
Will look left to right through an entire piece of text and return a match object for the first match to the regular expression given.   
If no match is found, _.search()_ will return None (unlike **.match( )** which will only find matches at the start of a string).
>ex: **result = re.search("\w{8}","Are you a Munchkin?")** search for a sequence of 8 word characters in the string "Are you a Munchkin?" (that will be "Munchkin")

* **.findall( )** :  
Finds all the occurrences of a word or keyword in a piece of text.  
Will return a list of all non-overlapping matches of the regular expression in the string   
(can be useful to find all the occurrences of a word or keyword in a piece of text to determine a frequency count).
>ex:   
. text  = *"Everything is green here, while in the country of the Munchkins blue was the favorite color. But the people do not seem to be as friendly as the Munchkins, and I'm afraid we shall be unable to find a place to pass the night."*   
. list_of_matches = re.findall("\w{8}",text)  
. returns the list:  *['Everythi', 'Munchkin', 'favorite', 'friendly', 'Munchkin']*

In [13]:
import re

# import L. Frank Baum's The Wonderful Wizard of Oz
oz_text = open("the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()

# search oz_text for an occurrence of 'wizard' here
found_wizard = re.search("wizard",oz_text)
print("found_wizard: ", found_wizard)

# find all the occurrences of 'lion' in oz_text here
all_lions = re.findall("lion", oz_text)
print("\n all_lions_list: ", all_lions)

# store and print the length of all_lions here
number_lions =  len(all_lions)
print("\n number_lions: ", number_lions)

found_wizard:  <_sre.SRE_Match object; span=(14, 20), match='wizard'>

 all_lions_list:  ['lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion'

# 3) Part-of-Speech Tagging

>ex:  
"Wow! Ramona and her class are happily studying the new textbook she has on NLP."


* **Noun**: the name of a person (Ramona,class), place, thing (textbook), or idea (NLP) = **NN** 


* **Pronoun**: a word used in place of a noun (her,she) = **PRP** 


* **Determiner**: a word that introduces, or “determines”, a noun (the) = **DT** 


* **Verb**: expresses action (studying) or being (are,has) = **VBP**
    * **VBD** is a past tense verb
    * **VBN** is a past particle verb


* **Adjective**: modifies or describes a noun or pronoun (new) = **JJ** 


* **Adverb**: modifies or describes a verb, an adjective, or another adverb (happily) = **RB** 


* **Preposition**: a word placed before a noun or pronoun to form a phrase modifying another word in the sentence (on) = **IN** 


* **Conjunction**: a word that joins words, phrases, or clauses (and) = **CC** 

 
* **Interjection**: a word used to express emotion (Wow) = **UH** 


* **pos_tag()** function:   
Automates the part-of-speech tagging process.  
The function takes one argument, a list of words in the order they appear in a sentence, and returns a list of tuples, where the first entry in the tuple is a word and the second is the part-of-speech tag

In [70]:
import nltk
from nltk import pos_tag # automates the part-of-speech tagging process
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import sent_tokenize # sentence tokenization
# import L. Frank Baum's The Wonderful Wizard of Oz
oz_text = open("the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()

# tokenization
sentences_tokenized = sent_tokenize(oz_text) # sentence tokenization

#100th sentence from the tokenized text above
sentence_tokenized_100th = sentences_tokenized[100]
print("100th tokenized sentence:", sentence_tokenized_100th)

# list bellow will contain all the token from the different sentences
tokenized_words_text = [] # an empty list

# for loop will token the words of every sentence separately
for sentence in sentences_tokenized:
    # tokenization of every sentence (so all words inside it)
    word_sentence_toke = word_tokenize(sentence)
    # POS tagging for every token (word) from a sentence 
    words_toke_for_all_sentences = pos_tag(word_sentence_toke)
    # POS tagging for every token added to list all
    tokenized_words_text.append(words_toke_for_all_sentences) # tokenized words + POS 

# print 18th to 19th sentence that is tokenized 
for n in range(18, 20): 
    print("\nsentence number ", n, ": ", tokenized_words_text[n])

words_toke_100th_sentence = tokenized_words_text[100]
print("\nWords_toke_100th_sentence: ",words_toke_100th_sentence)

100th tokenized sentence: "the house must have fallen on her.

sentence number  18 :  [('the', 'DT'), ('sun', 'NN'), ('and', 'CC'), ('wind', 'NN'), ('had', 'VBD'), ('changed', 'VBN'), ('her', 'PRP'), (',', ','), ('too', 'RB'), ('.', '.')]

sentence number  19 :  [('they', 'PRP'), ('had', 'VBD'), ('taken', 'VBN'), ('the', 'DT'), ('sparkle', 'NN'), ('from', 'IN'), ('her', 'PRP$'), ('eyes', 'NNS'), ('and', 'CC'), ('left', 'VBD'), ('them', 'PRP'), ('a', 'DT'), ('sober', 'JJ'), ('gray', 'NN'), (';', ':'), ('they', 'PRP'), ('had', 'VBD'), ('taken', 'VBN'), ('the', 'DT'), ('red', 'JJ'), ('from', 'IN'), ('her', 'PRP$'), ('cheeks', 'NN'), ('and', 'CC'), ('lips', 'NNS'), (',', ','), ('and', 'CC'), ('they', 'PRP'), ('were', 'VBD'), ('gray', 'JJ'), ('also', 'RB'), ('.', '.')]

Words_toke_100th_sentence:  [('``', '``'), ('the', 'DT'), ('house', 'NN'), ('must', 'MD'), ('have', 'VB'), ('fallen', 'VBN'), ('on', 'IN'), ('her', 'PRP'), ('.', '.')]


In [68]:
import nltk
from nltk import pos_tag # automates the part-of-speech tagging process
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import sent_tokenize # sentence tokenization

sentence_to_POS = "Everything was beautiful and nothing hurt."

# list bellow will contain all the token from the different sentences
POS_text = [] # an empty list

# no for loop is needed because we only have one sentence to do POS recognition
sentence_toke = word_tokenize(sentence_to_POS)
# POS tagging for every token (word) from a sentence 
POS_of_a_word = pos_tag(sentence_toke)
# POS tagging for every token added to list all
POS_text.append(POS_of_a_word) # tokenized words + POS 


print(POS_text)

[[('Everything', 'NN'), ('was', 'VBD'), ('beautiful', 'JJ'), ('and', 'CC'), ('nothing', 'NN'), ('hurt', 'NN'), ('.', '.')]]


# 4) Introduction to Chunking

* **chunking** = technique of grouping words by their part-of-speech (POS) tag.  

Given your part-of-speech tagged text, you can now use regular expressions to find patterns in sentence structure that give insight into the meaning of a text.

### _______________

The regular expression we build to find chunks is called chunk grammar. A piece of chunk grammar can be written as follows:

chunk_grammar = " AN : {< JJ >< NN >} "

    - AN is a user-defined name for the kind of chunk you are searching for. 
    We can use whatever name makes sense given our chunk grammar. In this case AN stands for adjective-noun
    - A pair of curly braces {} surround the actual chunk grammar
    - <JJ> operates similarly to a regex character class, matching any adjective
    - <NN> matches any noun, singular or plural


In [14]:
# -------------- this part is just to prepare the tokenized sentences ---------------
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import sent_tokenize # sentence tokenization
from nltk import pos_tag # automates the part-of-speech tagging process
# import L. Frank Baum's The Wonderful Wizard of Oz
oz_text = open("the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()
# tokenization
sentences_tokenized = sent_tokenize(oz_text) # sentence tokenization
# list bellow will contain all the token from the different sentences
tokenized_words_text = [] # an empty list
# for loop will token the words of every sentence separately
for sentence in sentences_tokenized:
    # tokenization of every sentence (so all words inside it)
    word_sentence_toke = word_tokenize(sentence)
    # POS tagging for every token (word) from a sentence 
    words_toke_for_all_sentences = pos_tag(word_sentence_toke)
    # POS tagging for every token added to list all
    tokenized_words_text.append(words_toke_for_all_sentences) # tokenized words + POS 
# --------------------------------------------------------------



# the code starts here bellow

from nltk import RegexpParser, Tree

# -- define adjective-noun chunk grammar -- 
# - AN is a user-defined name for the kind of chunk you are searching for. 
#   You can use whatever name makes sense given your chunk grammar. 
#   In this case AN stands for adjective-noun
# - <JJ> matches any adjective (operates similarly to a regex character class, matching any adjective)
# - <NN> matches any noun, singular or plural
# - {} surrounds the chunk grammar
chunk_grammar = "AN: {<JJ><NN>}"  # will match any adjective JJ that is followed by a noun NN

# -- create "RegexpParser" object --
# - To use the chunk grammar defined, we must create a nltk "RegexpParser" object and 
#   give it a piece of chunk grammar ("chunk_grammar") as an argument
chunk_parser = RegexpParser(chunk_grammar) # chunk_grammar = argument
print("- chunk_parser looks like: \n", chunk_parser)

# chunk the POS tagged sentence at index 282 in "tokenized_words_text"
print("\n- POS tagged sentence of interesst: ", tokenized_words_text[282])
scaredy_cat = chunk_parser.parse(tokenized_words_text[282])
print("\n- Chunking-Parsed sentence ('scaredy_cat'): ", scaredy_cat, "\n")

# pretty_print the chunked sentence
print("- In form of a tree 'scaredy_cat' sentence : \n")
Tree.fromstring(str(scaredy_cat)).pretty_print()

- chunk_parser looks like: 
 chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<JJ><NN>'>

- POS tagged sentence of interesst:  [('``', '``'), ('where', 'WRB'), ('is', 'VBZ'), ('the', 'DT'), ('emerald', 'JJ'), ('city', 'NN'), ('?', '.'), ("''", "''")]

- Chunking-Parsed sentence ('scaredy_cat'):  (S ``/`` where/WRB is/VBZ the/DT (AN emerald/JJ city/NN) ?/. ''/'') 

- In form of a tree 'scaredy_cat' sentence : 

                         S                                    
   ______________________|__________________________           
  |       |       |      |     |    |               AN        
  |       |       |      |     |    |        _______|_____     
``/`` where/WRB is/VBZ the/DT ?/. ''/'' emerald/JJ     city/NN



## 4 - 1) Chunking Noun Phrases


**Types of chunking** that are linguistically helpful for determining *meaning* and *bias* in a piece of text.   
A **noun phrase** is a phrase that contains _a noun_ and operates, as a unit, as a noun.



* A popular form of noun phrase begins with a determiner **DT** (which specifies the noun being referenced),   followed by any number of adjectives **JJ** (which describe the noun) and ends with a noun **NN**.

> ex : The POS-ed sentence :  
[('we', 'PRP'), ('are', 'VBP'), ('so', 'RB'), ('grateful', 'JJ'), ('to', 'TO'), ('you', 'PRP'), ('for', 'IN'), ('having', 'VBG'), ('killed', 'VBN'), ('the', '**DT**'), ('wicked', '**JJ**'), ('witch', '**NN**'), ('of', 'IN'), ('the', '**DT**'), ('east', '**NN**'), (',', ','), ('and', 'CC'), ('for', 'IN'), ('setting', 'VBG'), ('our', 'PRP'), ('people', 'NNS'), ('free', 'VBP'), ('from', 'IN'), ('bondage', 'NN'), ('.', '.')]  
Has 2 Chunking Noun Phrases:
 - (('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'))
 - (('the', 'DT'), ('east', 'NN'))

### ______


We can easily find all the non-overlapping noun phrases in a piece of text.  
Just like in normal regular expressions, we can use quantifiers to indicate how many of each POS we want to match.

The chunk grammar for a noun phrase (**NP-chunks**) can be written as follows :

chunk_grammar = " NP : {< DT > ? < JJ > * < NN >} "

    - NP is the user-defined name of the chunk we are searching for. In this case NP stands for noun phrase
    - <DT> matches any determiner
    - ? is an optional quantifier, matching either 0 or 1 determiners
    - <JJ> matches any adjective
    - * is the Kleene star quantifier, matching 0 or more occurrences of an adjective
    - <NN> matches any noun, singular or plural
    

**->** Via **NP-chunks** in a text, we can perform :

* a frequency analysis and identify important, recurring noun phrases
* identify pseudo-topics and tag articles and documents by their highest count NP-chunks
* analyzing the adjective choices an author makes for different nouns

    

In [46]:
# -------------- this part is just to prepare the tokenized sentences ---------------
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import sent_tokenize # sentence tokenization
from nltk import pos_tag # automates the part-of-speech tagging process
# import L. Frank Baum's The Wonderful Wizard of Oz
oz_text = open("the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()
# tokenization
sentences_tokenized = sent_tokenize(oz_text) # sentence tokenization
# list bellow will contain all the token from the different sentences
tokenized_words_text = [] # an empty list
# for loop will token the words of every sentence separately
for sentence in sentences_tokenized:
    # tokenization of every sentence (so all words inside it)
    word_sentence_toke = word_tokenize(sentence)
    # POS tagging for every token (word) from a sentence 
    words_toke_for_all_sentences = pos_tag(word_sentence_toke)
    # POS tagging for every token added to list all
    tokenized_words_text.append(words_toke_for_all_sentences) # tokenized words + POS 
# --------------------------------


# the code starts here
from nltk import RegexpParser
from np_chunk_counter import np_chunk_counter

# define noun-phrase chunk grammar 
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

# create RegexpParser object 
chunk_parser = RegexpParser(chunk_grammar)

# create a list to hold noun-phrase chunked sentences
np_chunked_oz = list()

# create a for loop through each pos-tagged sentence in pos_tagged_oz here
for pos_tagged_sentence in tokenized_words_text:
  # chunk each sentence and append to np_chunked_oz here
  np_chunked_oz.append(chunk_parser.parse(pos_tagged_sentence))

# store and print the 30 most common np-chunks from a list of chunked sentences
most_common_np_chunks = np_chunk_counter(np_chunked_oz)
print("Here below we can see the distribution of words but also for small NP-Chunks sentences: \n")
print(most_common_np_chunks)

'''Words alone : ((('i', 'NN'),), 326),   ((('dorothy', 'NN'),), 222) \
   Couple words : ((('the', 'DT'), ('scarecrow', 'NN')), 213),    ((('the', 'DT'), ('lion', 'NN')), 148),   
   ((('the', 'DT'), ('door', 'NN')), 21) \ 
    3 words coming together : ((('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN')), 21)'''
    
# Analysis :
# ((('the', 'DT'),('lion', 'NN')), 148) = lion mentioned more often 
# ((('the', 'DT'), ('tin', 'NN')), 123) = lin mentioned less often 

Here below we can see the distribution of words but also for small NP-Chunks sentences: 

[((('i', 'NN'),), 326), ((('dorothy', 'NN'),), 222), ((('the', 'DT'), ('scarecrow', 'NN')), 213), ((('the', 'DT'), ('lion', 'NN')), 148), ((('the', 'DT'), ('tin', 'NN')), 123), ((('woodman', 'NN'),), 112), ((('oz', 'NN'),), 86), ((('toto', 'NN'),), 73), ((('head', 'NN'),), 59), ((('the', 'DT'), ('woodman', 'NN')), 59), ((('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN')), 58), ((('the', 'DT'), ('emerald', 'JJ'), ('city', 'NN')), 51), ((('the', 'DT'), ('witch', 'NN')), 49), ((('the', 'DT'), ('girl', 'NN')), 46), ((('the', 'DT'), ('road', 'NN')), 41), ((('room', 'NN'),), 29), ((('nothing', 'NN'),), 29), ((('the', 'DT'), ('air', 'NN')), 29), ((('the', 'DT'), ('country', 'NN')), 26), ((('the', 'DT'), ('land', 'NN')), 24), ((('a', 'DT'), ('heart', 'NN')), 24), ((('the', 'DT'), ('west', 'NN')), 23), ((('axe', 'NN'),), 23), ((('the', 'DT'), ('sun', 'NN')), 22), ((('the', 'DT'), ('little', 'JJ'), ('girl', 

"Words alone : ((('i', 'NN'),), 326),   ((('dorothy', 'NN'),), 222)    Couple words : ((('the', 'DT'), ('scarecrow', 'NN')), 213),    ((('the', 'DT'), ('lion', 'NN')), 148),   \n   ((('the', 'DT'), ('door', 'NN')), 21) \\ \n    3 words coming together : ((('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN')), 21)"

## 4 - 2) Chunking Verb Phrases

Another popular type of chunking is **VP-chunking**, or verb phrase chunking. 
* A verb phrase is a phrase that contains a verb and its complements, objects, or modifiers.

The *first structure* begins with a verb VB of any tense, followed by a noun phrase, and ends with an optional adverb RB of any form.   
The *second structure* switches the order of the verb and the noun phrase, but also ends with an optional adverb.

>ex : For example, consider the part-of-speech tagged verb phrases given below:  
- (('said', 'VBD'), ('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN')) -> *first structure*
- ('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN')), (('said', 'VBD') -> *second structure*

###  __________

* The chunk grammar to find the *first form* of verb phrase is given below:  
  *  chunk_grammar = " VP: {< VB. *> < DT > ? < JJ > * < NN > < RB.? > ? } "
  
  
* The chunk grammar for the *second form* of verb phrase is given below:
   * chunk_grammar = " VP: { < DT> ? < JJ > * < NN > < VB.* > < RB.? > ? } "

And we have:

    - VP is the user-defined name of the chunk you are searching for. In this case VP stands for verb phrase
    - <VB.*> matches any verb using the . as a wildcard and the * quantifier to match 0 or more occurrences  of any character. This ensures matching verbs of any tense (ex. VB for present tense, VBD for past tense,
    or VBN for past participle)
    (- * is the Kleene star quantifier, matching 0 or more occurrences of an adjective)
    - <DT>?<JJ>*<NN> matches any noun phrase
    - <RB.?> matches any adverb using the . as a wildcard and the optional quantifier to match 0 or
    1 occurrence of any character. This ensures matching any form of adverb (regular RB, comparative RBR, or
    superlative RBS)
    - ? is an optional quantifier, matching either 0 or 1 adverbs



* **VP-chunks** :  
Just like with NP-chunks, we can find all them in a text and perform a frequency analysis to identify **important, recurring verb phrases**.

In [43]:
# -------------- this part is just to prepare the tokenized sentences ---------------
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import sent_tokenize # sentence tokenization
from nltk import pos_tag # automates the part-of-speech tagging process
# import L. Frank Baum's The Wonderful Wizard of Oz
oz_text = open("the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()
# tokenization
sentences_tokenized = sent_tokenize(oz_text) # sentence tokenization
# list bellow will contain all the token from the different sentences
tokenized_words_text = [] # an empty list
# for loop will token the words of every sentence separately
for sentence in sentences_tokenized:
    # tokenization of every sentence (so all words inside it)
    word_sentence_toke = word_tokenize(sentence)
    # POS tagging for every token (word) from a sentence 
    words_toke_for_all_sentences = pos_tag(word_sentence_toke)
    # POS tagging for every token added to list all
    tokenized_words_text.append(words_toke_for_all_sentences) # tokenized words + POS 
# --------------------------------


# code starts here
# ------------  first structure  ------------------
print("\nFirst structure : ")
from nltk import RegexpParser
from vp_chunk_counter import vp_chunk_counter

# define verb phrase chunk grammar (first form of verb phrase)
chunk_grammar = "VP: {<VB.*><DT>?<JJ>*<NN><RB.?>?} " # first structure

# create RegexpParser object here
chunk_parser = RegexpParser(chunk_grammar)

# create a list to hold verb-phrase chunked sentences
vp_chunked_oz = list()

# create for loop through each pos-tagged sentence in pos_tagged_oz here
for pos_tagged_sentence in tokenized_words_text:
  # chunk each sentence and append to vp_chunked_oz here
  vp_chunked_oz.append(chunk_parser.parse(pos_tagged_sentence))
  
# store and print the most common vp-chunks here
most_common_vp_chunks = vp_chunk_counter(vp_chunked_oz)
print(most_common_vp_chunks)

# Analysis : 
# ((('said', 'VBD'), ('the', 'DT'), ('tin', 'NN')), 19) = tin speaks more 
# ((('said', 'VBD'), ('the', 'DT'), ('lion', 'NN')), 15) = lion speaks less


# ------------  second structure  ------------------
print("\nSecond structure : ")
# update the grammar to find a verb phrase of the following form: noun phrase, followed by a verb VB, followed by an optional adverb RB
chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}" # second structure
# create RegexpParser object here
chunk_parser = RegexpParser(chunk_grammar)

# create a list to hold verb-phrase chunked sentences
vp_chunked_oz = list()

# create for loop through each pos-tagged sentence in pos_tagged_oz here
for pos_tagged_sentence in tokenized_words_text:
  # chunk each sentence and append to vp_chunked_oz here
  vp_chunked_oz.append(chunk_parser.parse(pos_tagged_sentence))
  
# store and print the most common vp-chunks here
most_common_vp_chunks = vp_chunk_counter(vp_chunked_oz)
print(most_common_vp_chunks)


First structure : 
[((('said', 'VBD'), ('the', 'DT'), ('scarecrow', 'NN')), 33), ((('said', 'VBD'), ('dorothy', 'NN')), 31), ((('asked', 'VBN'), ('dorothy', 'NN')), 20), ((('said', 'VBD'), ('the', 'DT'), ('tin', 'NN')), 19), ((('said', 'VBD'), ('the', 'DT'), ('lion', 'NN')), 15), ((('said', 'VBD'), ('the', 'DT'), ('girl', 'NN')), 10), ((('asked', 'VBN'), ('the', 'DT'), ('scarecrow', 'NN')), 10), ((('answered', 'VBD'), ('the', 'DT'), ('scarecrow', 'NN')), 8), ((('said', 'VBD'), ('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN')), 8), ((('said', 'VBD'), ('oz', 'NN')), 8), ((('said', 'VBD'), ('the', 'DT'), ('woodman', 'NN')), 7), ((('pass', 'VB'), ('the', 'DT'), ('night', 'NN')), 6), ((('asked', 'VBN'), ('the', 'DT'), ('girl', 'NN')), 6), ((('see', 'VB'), ('the', 'DT'), ('great', 'JJ'), ('oz', 'NN')), 6), ((('answered', 'VBD'), ('oz', 'NN')), 6), ((('replied', 'VBD'), ('oz', 'NN')), 6), ((('cried', 'VBN'), ('dorothy', 'NN')), 5), ((('asked', 'VBN'), ('the', 'DT'), ('tin', 'NN')), 5), ((('

## 4 - 3) Chunk filtering 
Chunk filtering lets you define what parts of speech you do not want in a chunk and remove them.

A popular method for performing chunk filtering: chunk an entire sentence together and then indicate which parts of speech are to be filtered out.  
If the filtered parts of speech are in the middle of a chunk, it will split the chunk into two separate chunks  


The chunk grammar to perform chunk filtering is given just below:

>ex: chunk_grammar = """ NP: {<.*>+}
                       }<VB.?|IN>+{ """

    - NP is the user-defined name of the chunk you are searching for.   
    In this case NP stands for noun phrase
    - The brackets {} indicate what parts of speech we are chunking.   
    <.*>+ matches every part of speech in the sentence
    - The inverted brackets }{ indicate which parts of speech you want to filter from the chunk.   
    <VB.?|IN>+ will filter out any verbs or prepositions
    

* **Chunk filtering**:  
  provides an alternate way for you to search through a text and find the chunks of information useful for your   analysis

In [47]:
# -------------- this part is just to prepare the tokenized sentences ---------------
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import sent_tokenize # sentence tokenization
from nltk import pos_tag # automates the part-of-speech tagging process
# import L. Frank Baum's The Wonderful Wizard of Oz
oz_text = open("the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()
# tokenization
sentences_tokenized = sent_tokenize(oz_text) # sentence tokenization
# list bellow will contain all the token from the different sentences
tokenized_words_text = [] # an empty list
# for loop will token the words of every sentence separately
for sentence in sentences_tokenized:
    # tokenization of every sentence (so all words inside it)
    word_sentence_toke = word_tokenize(sentence)
    # POS tagging for every token (word) from a sentence 
    words_toke_for_all_sentences = pos_tag(word_sentence_toke)
    # POS tagging for every token added to list all
    tokenized_words_text.append(words_toke_for_all_sentences) # tokenized words + POS 
# --------------------------------


# code starts here
from nltk import RegexpParser, Tree

# part where : chunk an entire sentence together
print("- chunk an entire sentence together: \n")
# define chunk grammar to chunk an entire sentence together
grammar = "Chunk: {<.*>+}" # # <.*>+ matches every POS in the sentence

# create RegexpParser object
parser = RegexpParser(grammar)

# chunk the pos-tagged sentence at index 230 in "tokenized_words_text"
chunked_dancers = parser.parse(tokenized_words_text[230])
print(chunked_dancers)


# ----------------
# # part where : chunk a noun phrase using chunk filtering
print("\n- chunk a noun phrase using chunk filtering: \n")
# define noun phrase chunk grammar using chunk filtering here
chunk_grammar = """NP: {<.*>+} 
                       }<VB.?|IN>+{""" 
# <.*>+ matches every POS in the sentence
# }<VB.?|IN>+{ filter out (}{) any verbs or prepositions (<VB.?|IN>+)

# create RegexpParser object 
chunk_parser = RegexpParser(chunk_grammar)

# chunk and filter the pos-tagged sentence at index 230 in "tokenized_words_text" 
filtered_dancers = chunk_parser.parse(tokenized_words_text[230])
print(filtered_dancers)

# pretty_print the chunked and filtered sentence here
print("\nIn form of a tree : \n")
Tree.fromstring(str(filtered_dancers)).pretty_print()

- chunk an entire sentence together: 

(S
  (Chunk
    then/RB
    she/PRP
    sat/VBD
    upon/IN
    a/DT
    settee/NN
    and/CC
    watched/VBD
    the/DT
    people/NNS
    dance/NN
    ./.))

- chunk a noun phrase using chunk filtering: 

(S
  (NP then/RB she/PRP)
  sat/VBD
  upon/IN
  (NP a/DT settee/NN and/CC)
  watched/VBD
  (NP the/DT people/NNS dance/NN ./.))

In form of a tree : 

                                                 S                                                  
    _____________________________________________|_______________________________                    
   |       |         |               NP                  NP                      NP                 
   |       |         |          _____|_____       _______|_______        ________|________________   
sat/VBD upon/IN watched/VBD then/RB     she/PRP a/DT settee/NN and/CC the/DT people/NNS dance/NN ./.



# REVIEW : 

We can perform some cool natural language parsing with regular expressions


* The re module’s **.compile( )** and **.match( )** methods allow you to enter any regex pattern and look for a single match at the beginning of a piece of text


* The re module’s **.search( )** method lets you find a single match to a regex pattern anywhere in a string, while the **.findall( )** method finds all the matches of a regex pattern in a string


* Part-of-speech tagging identifies and labels the part of speech of words in a sentence, and can be performed in nltk using the **pos_tag( )** function


* Chunking groups together patterns of words by their part-of-speech tag. Chunking can be performed in nltk by defining a piece of chunk grammar using regular expression syntax and calling a **RegexpParser**‘s **.parse( )** method on a word tokenized sentence


* NP-chunking chunks together an optional determiner DT, any number of adjectives JJ, and a noun NN to form a noun phrase. The frequency of different NP-chunks can identify important topics in a text or demonstrate how an author describes different subjects


* VP-chunking chunks together a verb VB, a noun phrase, and an optional adverb RB to form a verb phrase. The frequency of different VP-chunks can give insight into what kind of action different subjects take or how the actions that different subjects take are described by an author, potentially indicating bias


* Chunk filtering provides an alternative means of chunking by specifying what parts of speech you do not want in a chunk and removing them
