Parsing is a stage of NLP concerned with **breaking / segmenting** up text **based on syntax**

By using Python’s _regular expression_ modulere and the _Natural Language Toolkit_, known as _NLTK_, you can find:    
* keywords of interest
* discover where and how often they are used 
* discern the parts-of-speech patterns in which they appear to understand the sometimes hidden meaning in a piece of writing

> ex:  highlight the biases of its author or uncover additional insights that even a deep, rigorous reading of the text might not reveal

# 1) Compiling and Matching

* **.compile( )** :  
This method takes a regular expression pattern as an argument and compiles the pattern into a regular expression object, which you can later use to find matching text

* **.match( )** :   
method that takes a string of text as an argument and looks for a single match to the regular expression that starts at the beginning of the string
    * finds a match that starts at the beginning of the string, it will return a match object
    * If there is no match, .match() will return None



In [9]:
import re

# characters are defined
character_1 = "Dorothy"
character_2 = "Henry"

# regular expression object that match any 7 character string later usable to find matching text
regular_expression = re.compile("[A-Za-z]{7}") #  recognizes upper or lower case characters
print("regular_expression: ", regular_expression)

# check for a match to character_1 here
# (check if regex matches string stored in character_1)
result_1 = regular_expression.match(character_1)
print("result_1: ", result_1)


# store and print the matched text here 
match_1 = result_1.group(0)
print("match_1: ", match_1)

# compile a regular expression to match a 7 character string of word characters
# and check for a match to character_2 here
regular_expression = re.compile("[A-Za-z]{7}") # search for something len = 7
result_2 = regular_expression.match(character_2) # but is len = 5
print("result_2: ", result_2) # so nothing is found => None
# 2 last lines above can be written in one line also
result_2 = re.match("[A-Za-z]{7}", character_2)
print("result_2: ", result_2)

regular_expression:  re.compile('[A-Za-z]{7}')
result_1:  <_sre.SRE_Match object; span=(0, 7), match='Dorothy'>
match_1:  Dorothy
result_2:  None
result_2:  None


# 2) Searching and Finding

* **.search( )** :   
Will look left to right through an entire piece of text and return a match object for the first match to the regular expression given.   
If no match is found, _.search()_ will return None (unlike **.match( )** which will only find matches at the start of a string).
>ex: **result = re.search("\w{8}","Are you a Munchkin?")** search for a sequence of 8 word characters in the string "Are you a Munchkin?" (that will be "Munchkin")

* **.findall( )** :  
Finds all the occurrences of a word or keyword in a piece of text.  
Will return a list of all non-overlapping matches of the regular expression in the string   
(can be useful to find all the occurrences of a word or keyword in a piece of text to determine a frequency count).
>ex:   
. text  = *"Everything is green here, while in the country of the Munchkins blue was the favorite color. But the people do not seem to be as friendly as the Munchkins, and I'm afraid we shall be unable to find a place to pass the night."*   
. list_of_matches = re.findall("\w{8}",text)  
. returns the list:  *['Everythi', 'Munchkin', 'favorite', 'friendly', 'Munchkin']*

In [13]:
import re

# import L. Frank Baum's The Wonderful Wizard of Oz
oz_text = open("the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()

# search oz_text for an occurrence of 'wizard' here
found_wizard = re.search("wizard",oz_text)
print("found_wizard: ", found_wizard)

# find all the occurrences of 'lion' in oz_text here
all_lions = re.findall("lion", oz_text)
print("\n all_lions_list: ", all_lions)

# store and print the length of all_lions here
number_lions =  len(all_lions)
print("\n number_lions: ", number_lions)

found_wizard:  <_sre.SRE_Match object; span=(14, 20), match='wizard'>

 all_lions_list:  ['lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion'

# 3) Part-of-Speech Tagging

>ex:  
"Wow! Ramona and her class are happily studying the new textbook she has on NLP."


* **Noun**: the name of a person (Ramona,class), place, thing (textbook), or idea (NLP)


* **Pronoun**: a word used in place of a noun (her,she)


* **Determiner**: a word that introduces, or “determines”, a noun (the)


* **Verb**: expresses action (studying) or being (are,has)


* **Adjective**: modifies or describes a noun or pronoun (new)


* **Adverb**: modifies or describes a verb, an adjective, or another adverb (happily)


* **Preposition**: a word placed before a noun or pronoun to form a phrase modifying another word in the sentence (on)


* **Conjunction**: a word that joins words, phrases, or clauses (and)


* **Interjection**: a word used to express emotion (Wow) 


* **pos_tag()** function:   
Automates the part-of-speech tagging process.  
The function takes one argument, a list of words in the order they appear in a sentence, and returns a list of tuples, where the first entry in the tuple is a word and the second is the part-of-speech tag

In [28]:
import nltk
from nltk import pos_tag # automates the part-of-speech tagging process
from nltk.tokenize import word_tokenize # word tokenization
from nltk.tokenize import sent_tokenize # sentence tokenization
# import L. Frank Baum's The Wonderful Wizard of Oz
oz_text = open("the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()

# tokenization
sentences_tokenized = sent_tokenize(oz_text) # sentence tokenization
sentence_tokenized_100th = sentences_tokenized[100]

print("100th tokenized sentence:", sentence_tokenized_100th)


tokenized_words_text = [] # an empty list

for sentence in sentences_tokenized:
    # tokenization of every sentence (so all words inside it)
    word_sentence_toke = word_tokenize(sentence)
    # POS tagging for every word from a sentence 
    words_toke_for_all_sentences = pos_tag(word_sentence_toke)
    # list contains all tokenized words from all the sentences
    tokenized_words_text.append(words_toke_for_all_sentences) 

words_toke_100th_sentence = tokenized_words_text[100]
print("Words_toke_100th_sentence: ",words_toke_100th_sentence)

100th tokenized sentence: "the house must have fallen on her.
Words_toke_100th_sentence:  [('``', '``'), ('the', 'DT'), ('house', 'NN'), ('must', 'MD'), ('have', 'VB'), ('fallen', 'VBN'), ('on', 'IN'), ('her', 'PRP'), ('.', '.')]
