### Review 
*  http://www.nltk.org/howto/tree.html
* https://www.nltk.org/book/ch07.html


```
>>> def ie_preprocess(document):
...    sentences = nltk.sent_tokenize(document) [1]
...    sentences = [nltk.word_tokenize(sent) for sent in sentences] [2]
...    sentences = [nltk.pos_tag(sent) for sent in sentences]
```


<br>

In [164]:

import nltk
    

In [165]:
#  if you dont see anything, you have problems
#  nltk.data.find('tokenizers/punkt')

In [166]:

# see what you have installed, and if not exist, then download...

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
    


In [167]:

nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [168]:

nltk.download('popular')


[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to

True

In [169]:
# nltk.download()

<br>

# Chunking With The NLTK

## An introduction and guide for linguists who aren't programmers, and programmers who aren't linguists.

by Luke Petschauer ([@lukewrites](https://twitter.com/lukewrites) | [linkedin](https://www.linkedin.com/in/lukepetschauer
) | [blog](http://lukewrites.com))

### Table of Contents

[What is the NLTK? How do I use this guide?](#What-is-the-NLTK?-How-do-I-use-this-guide?)

[What is 'Chunking' and why should I do it?](#What-is-'Chunking',-and-why-should-I-do-it?)

[Note for Non-Linguists: Noun Phrases](#Note-for-Non-Linguists:-Noun-Phrases)

This guide is intended for two audiences:
  1. **Linguists** who have awesome ideas to realize, but aren't comfortable using Python. Code is broken up into small chunks that are thoroughly explained both in the text and via in-line comments.
  2. **Programmers** who want to analyze language with Python, but aren't familiar enough with linguistics terminology to fully exploit the NLTK. Linguistic terms are explained, examples are provided, and code samples are given.
  
The code for this notebook [is on github](https://github.com/lukewrites/NP_chunking_with_nltk).

### What is the NLTK? How do I use this guide?

The Natural Language Toolkit ([NLTK](http://www.nltk.org/)) is an open source library of tools for natural language processing with [Python](http://python.org). A number of the tools included in the NLTK have direct applications in linguistics, and have the potential to be of great use to linguists.


### What is 'Chunking', and why should I do it?

_Chunking_ breaks a text up into user-defined units ('_chunks_') that contain certain types of words (nouns, adjectives, verbs) or phrases (noun phrases, verb phrases, prepositional phrases). What makes chunking with the NLTK different from using a built-in string method like `split` is the NLTK's ability to analyze the text and tag each word with its part of speech. 

Chunking can be very useful when undertaking analysis of text, though it is more computationally intensive than preparing text for frequency analysis.

### Note for Non-Linguists: Noun Phrases

A noun phrase is "a single word, a part of a word, or a chain of words" ([Wikipedia](http://en.wikipedia.org/wiki/Lexical_item)) that has a noun as its _head_ (main) _word_. "Everton football club's victory" has the noun 'Everton' as its headword; "Everton football club" is thus a noun phrase since the three words (which in this case are all nouns) together describe a single thing/idea/entity. 

Most _word classes_ (nouns, verbs, adjectives, etc.) can form phrases; articles/determiners (_the_, _a_, etc) and pronouns are examples of word classes that do not.

For example:

| Phrase | Head Word | Phrase Type |
| ------ | ------ | ------ |
| super happy | happy (adjective) | adjective phrase (AP) |
| kick the bucket | kick (verb) | verb phrase (VP) |
| an extremely difficult problem | problem (noun) | noun phrase (NP) |
| over the river | over (preposition) | prepositional phrase (PP) |

### Why Chunk?

Chunking is useful for selecting and extracting meaningful information from texts for analysis. Chunking allows us to pull out groups of words with set characteristics rather than selecting text by frequency.

In this tutorial we will perform an analysis of the text of the etiquette book [_Beadle's Dime Book of Practical Etiquette for Ladies and Gentlemen_](http://www.gutenberg.org/ebooks/45591?msg=welcome_stranger). I've chosen this book more or less at random from Project Gutenberg because we can predict that it will use lots of domain-specific vocabulary. We'd hope that our chunker will be able to automatically pull out such language. 

## Chunking vs `split`ing

We could use the Python `split` string method on the text, resulting in a big list of words. Here is the result of splitting just the first sentence:

In [170]:
etiquette_excerpt = """If you wish to make yourself agreeable to a lady, turn the conversation adroitly upon taste, or art, or books, or persons, or events of the day.
"""

etiquette_excerpt.split()
#split() will take the string sentence, and then split it by ' ' spaces, and then return the words IN A LIST format...

['If',
 'you',
 'wish',
 'to',
 'make',
 'yourself',
 'agreeable',
 'to',
 'a',
 'lady,',
 'turn',
 'the',
 'conversation',
 'adroitly',
 'upon',
 'taste,',
 'or',
 'art,',
 'or',
 'books,',
 'or',
 'persons,',
 'or',
 'events',
 'of',
 'the',
 'day.']

If we did this to the whole text, we could do a frequency analysis to see which words are most common. This is not particularly helpful, since additional processing would be required to remove [stop words](http://en.wikipedia.org/wiki/Stop_words). Also, notice that `lady,` and `day.` (punctuation appended) are assumed to be words; we need to make sure that punctuation is stripped away.

## Chunking based on part of speech (POS)

Below we will look at how extracting chunks of text can allow us to gain insight into a text that word frequency does not.

If we were to extract the nouns from our sample sentence, we would get a list of words including the following:

```python
[lady, conversation, taste, art, books, persons, events, day, …]
```

This is a perfectly fine list, and one from which we probably could make sound assumptions about the nature of the text, but we can get a better sense of and learn more about the text if we are able to see the noun phrases (NP) in it.

To illustrate the difference between the two: while extracting nouns alone would return

```python
[…, events, day, …]
```

extracting NPs could return

```python
[…, events of the day, …]
```

This is arguably a more meaningful chunk of language since it gives us a specific concept that the etiquette book mentions, rather than just a list of the topic's constituent nouns. Automatically being able to extract a number of NPs from a text can allow us to make good guesses about what the text is about, among other uses.

We can easily extract NPs using the NLTK. To do so, we need to define what language patterns we want the NLTK to identify as NPs. The NLTK uses regular expressions to set these definitions.

### Workflow

To extract NP from our `sample_text`, we will need to do the following:
>    0. Set up our environment.
>    1. Identify and store a text that we want to analyze.
>    2. Define the patterns that we want the NLTK to identify as being NP.
>    3. Prepare the text by _tokenizing_ it and _tagging_ each word. (Descriptions of _tokenizing_ and _tagging_ come below.)
>    4. Having the NLTK identify NP in the tokenized text and, finally, showing us a list of these NPs.
Each of these three steps will give us the opportunity to learn more about programming with Python and linguistics.

### 1. Set Up Our Environment

To go through this tutorial you need to have installed the NLTK and numpy. You can find out how to do that by follow the previous links.

At the beginning of your Python script, you need to import `nltk`, `re` (regular expressions, which will be used in step 2), `pprint` (necessary to create trees, an intermediary step in our chunking process), and `Tree` from the `nltk` library. 

The beginning of our script will look like this:

In [171]:
import nltk
import re
import pprint
from nltk import Tree

In the above we are importing libraries necessary to make the code run. These libraries include the NLTK (`nltk`), regular expressions (`re`), and data pretty printer (`pprint`).

<br>

### 2. Define our NPs

The NLTK can find NPs, but we have to tell the NLTK what chunks of language it should identify as noun phrases. To do this we need to know two things:

> 1. What notation does the NLTK use for parts of speech?

> 2. How can we write NP definitions that allow for ambiguity?

To answer (1), we will look at how the NLTK tags words for part of speech (POS). To answer (2), we will need to gain a basic understanding of _regular expressions_…the reason that we had to 

```python
import re
```

at the beginning of our Python script.

#### 2.1 Parts of Speech in the NLTK

The NLTK provides a function, `pos_tag()`, that tags POS using the [Penn Treebank Tag Set](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). Here is a partial list of its tags:

> 1.	CC	Coordinating conjunction
2.	CD	Cardinal number
3.	DT	Determiner
4.	EX	Existential there
5.	FW	Foreign word
6.	IN	Preposition or subordinating conjunction
7.	JJ	Adjective
8.	JJR	Adjective, comparative
9.	JJS	Adjective, superlative
10.	LS	List item marker
11.	MD	Modal
12.	NN	Noun, singular or mass
13.	NNS	Noun, plural
14.	NNP	Proper noun, singular
15.	NNPS	Proper noun, plural

Notice that the Penn Treebank Tag Set differentiates between four different types of nouns: `NN`, `NNS`, `NNP`, `NNPS`. In this case, we will want to consider all types of nouns: proper and common, singular/mass and plural. Rather than writing separate rules for each case, we can use regular expressions to include them all.


```
TOTAL:

Tag Description
CC	Coordinating conjunction
CD	Cardinal number
DT	Determiner
EX	Existential there
FW	Foreign word
IN	Preposition or subordinating conjunction
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
LS	List item marker
MD	Modal
NN	Noun, singular or mass
NNS	Noun, plural
NNP	Proper noun, singular
NNPS	Proper noun, plural
PDT	Predeterminer
POS	Possessive ending
PRP	Personal pronoun
PRP$	Possessive pronoun
RB	Adverb
RBR	Adverb, comparative
RBS	Adverb, superlative
RP	Particle
SYM	Symbol
TO	to
UH	Interjection
VB	Verb, base form
VBD	Verb, past tense
VBG	Verb, gerund or present participle
VBN	Verb, past participle
VBP	Verb, non-3rd person singular present
VBZ	Verb, 3rd person singular present
WDT	Wh-determiner
WP	Wh-pronoun
WP$	Possessive wh-pronoun
WRB	Wh-adverb
```



<br>

#### 2.2 Regular Expressions and chunk patterns

We can identify and tag chunks based upon _morphological structure_ using **_regular expressions_**.

Regular expressions have a reputation for being complicated and difficult. Click on the following to [read more about regular expressions in Python](https://docs.python.org/2/library/re.html).

We are going to use regex to define NPs to be certain patterns:
* adjective (optional) + one or more noun of any type
* adjective (optional) + one noun of any type + cordinating conjunction (optional) + one or more noun of any type

The way we would write these patterns is:

In [172]:
patterns = """
    NP: {<JJ>*<NN>+}
    {<JJ>*<NN><CC>*<NN>+}
    """

In [173]:
[line for line in patterns.split('\n')]

['', '    NP: {<JJ>*<NN>+}', '    {<JJ>*<NN><CC>*<NN>+}', '    ']

In [174]:
print(patterns)


    NP: {<JJ>*<NN>+}
    {<JJ>*<NN><CC>*<NN>+}
    


Since the patterns span more than one line, we enclose them with triple quotes `"""`.

You notice that the patterns are being defined as a variable. This variable is necessary as they will be used with the NLTK's regular expression parser:

In [175]:
NPChunker = nltk.RegexpParser(patterns)

In [176]:
print(type(NPChunker))

<class 'nltk.chunk.regexp.RegexpParser'>


Again, here we create another variable, `NPChunker` that calls the `RegexpParser` method using our `patterns`.

### 3. Create a text sample

We need to save the text as a _variable_. In Python, we _assign variables_ by entering the variable name followed by an `=` sign:

```python
sample_text =  # our variable name is sample_text
```

Whatever data we are assigning to the variable goes immediately to the right of the `=` sign.

Since we are using multiple lines of text, we will surround the text with triple quotes:
```python
"""text
"""
```

Our final variable assignment will look like this:

In [177]:
sample_text = """Good behavior upon the street, or public promenade, marks the gentleman
most effectually; rudeness, incivility, disregard of "what the world
says," marks the person of low breeding. We always know, in walking a
square with a man, if he is a gentleman or not. A real gentility never
does the following things on the street, in presence of observers:--

Never picks the teeth, nor scratches the head.

Never swears or talks uproariously.

Never picks the nose with the finger.

Never smokes, or spits upon the walk, to the exceeding annoyance of
those who are always disgusted with tobacco in any shape.

Never stares at any one, man or woman, in a marked manner.

Never scans a lady's dress impertinently, and makes no rude remarks
about her.

Never crowds before promenaders in a rough or hurried way.

Never jostles a lady or gentleman without an "excuse me."

Never treads upon a lady's dress without begging pardon.

Never loses temper, nor attracts attention by excited conversation.

Never dresses in an odd or singular manner, so as to create remark.

Never fails to raise his hat politely to a lady acquaintance; nor to
a male friend who may be walking with a lady--it is a courtesy to the
lady.
"""

In [178]:
print(sample_text)

Good behavior upon the street, or public promenade, marks the gentleman
most effectually; rudeness, incivility, disregard of "what the world
says," marks the person of low breeding. We always know, in walking a
square with a man, if he is a gentleman or not. A real gentility never
does the following things on the street, in presence of observers:--

Never picks the teeth, nor scratches the head.

Never swears or talks uproariously.

Never picks the nose with the finger.

Never smokes, or spits upon the walk, to the exceeding annoyance of
those who are always disgusted with tobacco in any shape.

Never stares at any one, man or woman, in a marked manner.

Never scans a lady's dress impertinently, and makes no rude remarks
about her.

Never crowds before promenaders in a rough or hurried way.

Never jostles a lady or gentleman without an "excuse me."

Never treads upon a lady's dress without begging pardon.

Never loses temper, nor attracts attention by excited conversation.

Never dress

<br>

### 4. Preparing our text

In order to identify and extract NP, we need to perform four steps:

> 1. _Tokenize_ the text into sentences.

> 2. _Tokenize_ each sentence into words.

> 3. Tag the words in each sentence for POS.

> 4. Go through each sentence and _chunk_ NPs.

The NLTK has corresponding functions and methods that can be used for each of these steps:

> 1. `nltk.sent_tokenize()`

> 2. `nltk.word_tokenize()`

> 3. `nltk.pos_tag()`

Let's play with each in turn to see what they do.

In [179]:
nltk.sent_tokenize(etiquette_excerpt)

['If you wish to make yourself agreeable to a lady, turn the conversation adroitly upon taste, or art, or books, or persons, or events of the day.']

The `nltk.sent_tokenize()` method takes a text and breaks it up into sentences.

Now we can break the sentence into words:

In [180]:
tokenized_sentence = nltk.sent_tokenize(etiquette_excerpt)

[nltk.word_tokenize(sentence) for sentence in tokenized_sentence]

[['If',
  'you',
  'wish',
  'to',
  'make',
  'yourself',
  'agreeable',
  'to',
  'a',
  'lady',
  ',',
  'turn',
  'the',
  'conversation',
  'adroitly',
  'upon',
  'taste',
  ',',
  'or',
  'art',
  ',',
  'or',
  'books',
  ',',
  'or',
  'persons',
  ',',
  'or',
  'events',
  'of',
  'the',
  'day',
  '.']]

This is an improvement upon the `str.split()` method we tried up above. Notice that punctuation is stripped from the words. We have written the method as a list comprehension, meaning that the tokenized sentence is a list within another list. This will be more useful when we are tokenizing more than one sentence at a time.

Our next step is to try some part of speech (POS) tagging, using `nltk.pos_tag()`.

In [181]:
# sidebar

In [182]:

second_verbage = """Good behavior upon the street, or public promenade, marks the gentleman
most effectually; rudeness, incivility, disregard of "what the world
says," marks the person of low breeding. We always know, in walking a
square with a man, if he is a gentleman or not. A real gentility never
does the following things on the street, in presence of observers:--"""

tokenized_sentence2 = nltk.sent_tokenize(second_verbage)

[nltk.word_tokenize(sentence2) for sentence2 in tokenized_sentence2]


# NOW you can see that each list (within the master list) is actually a sentence broken out
#   i.e.  the paragraph is broken to sentences, and each sentence is broken out into tokenized words, and each
#         list of those words is broken into its OWN list, i.e. a list of listssss


[['Good',
  'behavior',
  'upon',
  'the',
  'street',
  ',',
  'or',
  'public',
  'promenade',
  ',',
  'marks',
  'the',
  'gentleman',
  'most',
  'effectually',
  ';',
  'rudeness',
  ',',
  'incivility',
  ',',
  'disregard',
  'of',
  '``',
  'what',
  'the',
  'world',
  'says',
  ',',
  "''",
  'marks',
  'the',
  'person',
  'of',
  'low',
  'breeding',
  '.'],
 ['We',
  'always',
  'know',
  ',',
  'in',
  'walking',
  'a',
  'square',
  'with',
  'a',
  'man',
  ',',
  'if',
  'he',
  'is',
  'a',
  'gentleman',
  'or',
  'not',
  '.'],
 ['A',
  'real',
  'gentility',
  'never',
  'does',
  'the',
  'following',
  'things',
  'on',
  'the',
  'street',
  ',',
  'in',
  'presence',
  'of',
  'observers',
  ':',
  '--']]

In [183]:
# end sidebar

In [184]:
tokenized_sentence = nltk.sent_tokenize(etiquette_excerpt)

tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]

[nltk.pos_tag(word) for word in tokenized_words]

# but you also see how the output is a list of lists, and within each sentence affiliated 'list', that
# it is a tuple, i.e. (word, pos marking)
# 

[[('If', 'IN'),
  ('you', 'PRP'),
  ('wish', 'VBP'),
  ('to', 'TO'),
  ('make', 'VB'),
  ('yourself', 'PRP'),
  ('agreeable', 'JJ'),
  ('to', 'TO'),
  ('a', 'DT'),
  ('lady', 'NN'),
  (',', ','),
  ('turn', 'VBP'),
  ('the', 'DT'),
  ('conversation', 'NN'),
  ('adroitly', 'RB'),
  ('upon', 'IN'),
  ('taste', 'NN'),
  (',', ','),
  ('or', 'CC'),
  ('art', 'NN'),
  (',', ','),
  ('or', 'CC'),
  ('books', 'NNS'),
  (',', ','),
  ('or', 'CC'),
  ('persons', 'NNS'),
  (',', ','),
  ('or', 'CC'),
  ('events', 'NNS'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('day', 'NN'),
  ('.', '.')]]

In [185]:

tokenized_sentence = nltk.sent_tokenize(etiquette_excerpt)

tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]

temper = [nltk.pos_tag(word) for word in tokenized_words]

#temper[0]  - the first sentence, note there is actually only a single sentence
#temper[0][0] - the first iterated sample within the list, and you can see it is a tuple...
# type(temper[0][0])    - see its a tuple 

temper[0][2]

# end note 

('wish', 'VBP')

<br>

In [186]:
# looking to see if it actually works...
nltk.pos_tag(['Tom'])

[('Tom', 'NNP')]

In [187]:
nltk.pos_tag(['sleeping'])

[('sleeping', 'VBG')]

<hr>

Now we're cooking!

In three lines of code, we've gone from a complete sentence to a sentence in which each word is tagged for part of speech. (The accuracy of the POS tagging is another matter for another tutorial.)

Let's move on to the last step, looking for NPs in the processed sentence. We will do this by using the `NPChunker` that we defined above.

In [235]:
# [[('If', 'IN'),
#   ('you', 'PRP'),
#   ('wish', 'VBP'),
#   ('to', 'TO'),
#   ('make', 'VB'),
#   ('yourself', 'PRP'),
#   ('agreeable', 'JJ'),
#   ('to', 'TO'),
#   ('a', 'DT'),
#   ('lady', 'NN'),
#   (',', ','),
#   ('turn', 'VBP'),
#   ('the', 'DT'),
#   ('conversation', 'NN'),
#   ('adroitly', 'RB'),
#   ('upon', 'IN'),
#   ('taste', 'NN'),
#   (',', ','),
#   ('or', 'CC'),
#   ('art', 'NN'),
#   (',', ','),
#   ('or', 'CC'),
#   ('books', 'NNS'),
#   (',', ','),
#   ('or', 'CC'),
#   ('persons', 'NNS'),
#   (',', ','),
#   ('or', 'CC'),
#   ('events', 'NNS'),
#   ('of', 'IN'),
#   ('the', 'DT'),
#   ('day', 'NN'),
#   ('.', '.')]]

In [236]:

tokenized_sentence = nltk.sent_tokenize(etiquette_excerpt)
tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]
tagged_words = [nltk.pos_tag(word) for word in tokenized_words]
print(tagged_words)

[NPChunker.parse(word) for word in tagged_words]


[[('If', 'IN'), ('you', 'PRP'), ('wish', 'VBP'), ('to', 'TO'), ('make', 'VB'), ('yourself', 'PRP'), ('agreeable', 'JJ'), ('to', 'TO'), ('a', 'DT'), ('lady', 'NN'), (',', ','), ('turn', 'VBP'), ('the', 'DT'), ('conversation', 'NN'), ('adroitly', 'RB'), ('upon', 'IN'), ('taste', 'NN'), (',', ','), ('or', 'CC'), ('art', 'NN'), (',', ','), ('or', 'CC'), ('books', 'NNS'), (',', ','), ('or', 'CC'), ('persons', 'NNS'), (',', ','), ('or', 'CC'), ('events', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('day', 'NN'), ('.', '.')]]


[Tree('S', [('If', 'IN'), ('you', 'PRP'), ('wish', 'VBP'), ('to', 'TO'), ('make', 'VB'), ('yourself', 'PRP'), ('agreeable', 'JJ'), ('to', 'TO'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), (',', ','), ('turn', 'VBP'), ('the', 'DT'), Tree('NP', [('conversation', 'NN')]), ('adroitly', 'RB'), ('upon', 'IN'), Tree('NP', [('taste', 'NN')]), (',', ','), ('or', 'CC'), Tree('NP', [('art', 'NN')]), (',', ','), ('or', 'CC'), ('books', 'NNS'), (',', ','), ('or', 'CC'), ('persons', 'NNS'), (',', ','), ('or', 'CC'), ('events', 'NNS'), ('of', 'IN'), ('the', 'DT'), Tree('NP', [('day', 'NN')]), ('.', '.')])]

In [237]:

# tokenized_sentence = nltk.sent_tokenize(etiquette_excerpt)
# tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]
# tagged_words = [nltk.pos_tag(word) for word in tokenized_words]
# temper = [NPChunker.parse(word) for word in tagged_words]

# print(temper)
# [Tree('S', [('If', 'IN'), ('you', 'PRP'), ('wish', 'VBP'), ('to', 'TO'), ('make', 'VB'), ('yourself', 'PRP'), 
#            ('agreeable', 'JJ'), ('to', 'TO'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), (',', ','), ('turn', 'VBP'), ('the', 'DT'), Tree('NP', [('conversation', 'NN')]), ('adroitly', 'RB'), ('upon', 'IN'), Tree('NP', [('taste', 'NN')]), (',', ','), ('or', 'CC'), Tree('NP', [('art', 'NN')]), (',', ','), ('or', 'CC'), ('books', 'NNS'), (',', ','), ('or', 'CC'), ('persons', 'NNS'), (',', ','), ('or', 'CC'), ('events', 'NNS'), ('of', 'IN'), ('the', 'DT'), Tree('NP', [('day', 'NN')]), ('.', '.')])]


### Sidebar:  Tree

In [190]:

dp1 = Tree('dp', [Tree('d', ['the']), Tree('np', ['dog'])])

dp2 = Tree('dp', [Tree('d', ['the']), Tree('np', ['cat'])])

vp = Tree('vp', [Tree('v', ['chased']), dp2])

tree = Tree('s', [dp1, vp])

print(tree)


(s (dp (d the) (np dog)) (vp (v chased) (dp (d the) (np cat))))


In [191]:
dp1.label(), dp2.label(), vp.label(), tree.label()

('dp', 'dp', 'vp', 's')

In [192]:
print(tree[1,1,1,0])

cat


In [193]:
tree.pretty_print()

              s               
      ________|_____           
     |              vp        
     |         _____|___       
     dp       |         dp    
  ___|___     |      ___|___   
 d       np   v     d       np
 |       |    |     |       |  
the     dog chased the     cat



In [194]:
tree.pretty_print(unicodelines=True, nodedist=4)

                       s                        
        ┌──────────────┴────────┐                   
        │                       vp              
        │              ┌────────┴──────┐            
        dp             │               dp       
 ┌──────┴──────┐       │        ┌──────┴──────┐     
 d             np      v        d             np
 │             │       │        │             │     
the           dog    chased    the           cat



In [240]:
[ method for method in dir(tree) if not method.startswith("_")]

['append',
 'chomsky_normal_form',
 'clear',
 'collapse_unary',
 'convert',
 'copy',
 'count',
 'draw',
 'extend',
 'flatten',
 'freeze',
 'fromstring',
 'height',
 'index',
 'insert',
 'label',
 'leaf_treeposition',
 'leaves',
 'node',
 'pformat',
 'pformat_latex_qtree',
 'pop',
 'pos',
 'pprint',
 'pretty_print',
 'productions',
 'remove',
 'reverse',
 'set_label',
 'sort',
 'subtrees',
 'treeposition_spanning_leaves',
 'treepositions',
 'un_chomsky_normal_form',
 'unicode_repr']

<hr>

In [196]:
print(tokenized_sentence)

['If you wish to make yourself agreeable to a lady, turn the conversation adroitly upon taste, or art, or books, or persons, or events of the day.']


In [197]:
print(tokenized_words)

[['If', 'you', 'wish', 'to', 'make', 'yourself', 'agreeable', 'to', 'a', 'lady', ',', 'turn', 'the', 'conversation', 'adroitly', 'upon', 'taste', ',', 'or', 'art', ',', 'or', 'books', ',', 'or', 'persons', ',', 'or', 'events', 'of', 'the', 'day', '.']]



```
CC	Coordinating conjunction
CD	Cardinal number
DT	Determiner
EX	Existential there
FW	Foreign word
IN	Preposition or subordinating conjunction
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
LS	List item marker
MD	Modal
NN	Noun, singular or mass
NNS	Noun, plural
NNP	Proper noun, singular
NNPS	Proper noun, plural
PDT	Predeterminer
POS	Possessive ending
PRP	Personal pronoun
PRP$	Possessive pronoun
RB	Adverb
RBR	Adverb, comparative
RBS	Adverb, superlative
RP	Particle
SYM	Symbol
TO	to
UH	Interjection
VB	Verb, base form
VBD	Verb, past tense
VBG	Verb, gerund or present participle
VBN	Verb, past participle
VBP	Verb, non-3rd person singular present
VBZ	Verb, 3rd person singular present
WDT	Wh-determiner
WP	Wh-pronoun
WP$	Possessive wh-pronoun
WRB	Wh-adverb
```



In [198]:
tagged_words

[[('If', 'IN'),
  ('you', 'PRP'),
  ('wish', 'VBP'),
  ('to', 'TO'),
  ('make', 'VB'),
  ('yourself', 'PRP'),
  ('agreeable', 'JJ'),
  ('to', 'TO'),
  ('a', 'DT'),
  ('lady', 'NN'),
  (',', ','),
  ('turn', 'VBP'),
  ('the', 'DT'),
  ('conversation', 'NN'),
  ('adroitly', 'RB'),
  ('upon', 'IN'),
  ('taste', 'NN'),
  (',', ','),
  ('or', 'CC'),
  ('art', 'NN'),
  (',', ','),
  ('or', 'CC'),
  ('books', 'NNS'),
  (',', ','),
  ('or', 'CC'),
  ('persons', 'NNS'),
  (',', ','),
  ('or', 'CC'),
  ('events', 'NNS'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('day', 'NN'),
  ('.', '.')]]

Fantastic! At first glance, this output is a mess, and in real-world applications we probably wouldn't ever see it. However, it's worthwhile to take a look at it now to see how the NLTK is using our `patterns` to organize the text and to preview what data will (or should be) extracted by the rest of our code.

In [199]:
sample_text

'Good behavior upon the street, or public promenade, marks the gentleman\nmost effectually; rudeness, incivility, disregard of "what the world\nsays," marks the person of low breeding. We always know, in walking a\nsquare with a man, if he is a gentleman or not. A real gentility never\ndoes the following things on the street, in presence of observers:--\n\nNever picks the teeth, nor scratches the head.\n\nNever swears or talks uproariously.\n\nNever picks the nose with the finger.\n\nNever smokes, or spits upon the walk, to the exceeding annoyance of\nthose who are always disgusted with tobacco in any shape.\n\nNever stares at any one, man or woman, in a marked manner.\n\nNever scans a lady\'s dress impertinently, and makes no rude remarks\nabout her.\n\nNever crowds before promenaders in a rough or hurried way.\n\nNever jostles a lady or gentleman without an "excuse me."\n\nNever treads upon a lady\'s dress without begging pardon.\n\nNever loses temper, nor attracts attention by excit

In [200]:
patterns

'\n    NP: {<JJ>*<NN>+}\n    {<JJ>*<NN><CC>*<NN>+}\n    '

In [201]:

def prepare_text(input):
    tokenized_sentence = nltk.sent_tokenize(input)  # Tokenize the text into sentences.
    tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]  # Tokenize words in sentences.
    tagged_words = [nltk.pos_tag(word) for word in tokenized_words]  # Tag words for POS in each sentence.
    word_tree = [NPChunker.parse(word) for word in tagged_words]  # Identify NP chunks
    return word_tree  # Return the tagged & chunked sentences.


In [249]:

prepare_text(sample_text)


[Tree('S', [Tree('NP', [('Good', 'JJ'), ('behavior', 'NN')]), ('upon', 'IN'), ('the', 'DT'), Tree('NP', [('street', 'NN')]), (',', ','), ('or', 'CC'), Tree('NP', [('public', 'JJ'), ('promenade', 'NN')]), (',', ','), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('gentleman', 'NN')]), ('most', 'RBS'), ('effectually', 'RB'), (';', ':'), Tree('NP', [('rudeness', 'NN')]), (',', ','), Tree('NP', [('incivility', 'NN')]), (',', ','), Tree('NP', [('disregard', 'NN')]), ('of', 'IN'), ('``', '``'), ('what', 'WP'), ('the', 'DT'), Tree('NP', [('world', 'NN')]), ('says', 'VBZ'), (',', ','), ("''", "''"), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('person', 'NN')]), ('of', 'IN'), Tree('NP', [('low', 'JJ'), ('breeding', 'NN')]), ('.', '.')]),
 Tree('S', [('We', 'PRP'), ('always', 'RB'), ('know', 'VBP'), (',', ','), ('in', 'IN'), ('walking', 'VBG'), ('a', 'DT'), Tree('NP', [('square', 'NN')]), ('with', 'IN'), ('a', 'DT'), Tree('NP', [('man', 'NN')]), (',', ','), ('if', 'IN'), ('he', 'PRP'), ('is', '

In [254]:

sidebar = prepare_text(sample_text)
for element in sidebar:print(element)


(S
  (NP Good/JJ behavior/NN)
  upon/IN
  the/DT
  (NP street/NN)
  ,/,
  or/CC
  (NP public/JJ promenade/NN)
  ,/,
  marks/VBZ
  the/DT
  (NP gentleman/NN)
  most/RBS
  effectually/RB
  ;/:
  (NP rudeness/NN)
  ,/,
  (NP incivility/NN)
  ,/,
  (NP disregard/NN)
  of/IN
  ``/``
  what/WP
  the/DT
  (NP world/NN)
  says/VBZ
  ,/,
  ''/''
  marks/VBZ
  the/DT
  (NP person/NN)
  of/IN
  (NP low/JJ breeding/NN)
  ./.)
(S
  We/PRP
  always/RB
  know/VBP
  ,/,
  in/IN
  walking/VBG
  a/DT
  (NP square/NN)
  with/IN
  a/DT
  (NP man/NN)
  ,/,
  if/IN
  he/PRP
  is/VBZ
  a/DT
  (NP gentleman/NN)
  or/CC
  not/RB
  ./.)
(S
  A/DT
  (NP real/JJ gentility/NN)
  never/RB
  does/VBZ
  the/DT
  (NP following/JJ things/NNS)
  on/IN
  the/DT
  (NP street/NN)
  ,/,
  in/IN
  (NP presence/NN of/IN observers/NNS)
  :/:
  --/:
  Never/RB
  picks/VBZ
  the/DT
  (NP teeth/NN)
  ,/,
  nor/CC
  scratches/VBZ
  the/DT
  (NP head/NN)
  ./.)
(S Never/RB (NP swears/NNS or/CC talks/NNS) uproariously/RB ./.)
(S
  N

At first glance, this output is a mess, and in real-world applications we probably wouldn't ever see it. However, it's worthwhile to take a look at it now to see how the NLTK is using our `patterns` to organize the text and to preview what data will (or should be) extracted by the rest of our code.

Consider the following sentence:

>Never smokes, or spits upon the walk, to the exceeding annoyance of
those who are always disgusted with tobacco in any shape.

Once processed and organized into a Tree, the sentence, indicated by `'S'`, is divided into `(word, part of speech)` tuples. The NLTK's output contains the following NPs:

```python
Tree('NP', [('walk', 'NN')]),
Tree('NP', [('exceeding', 'NN'), ('annoyance', 'NN')]),
Tree('NP', [('tobacco', 'NN')]), 
Tree('NP', [('shape', 'NN')]), ('.', '.')])
```

Notice that each NP is identified as its own Tree, while no other part of speech in the sentence is organized into trees. This is the power of our `patterns`; `nltk.RegexpParser(patterns)` looks for chunks of text we defined as NPs and organizes them as NPs, each of which gets its own tree (or subtree) within the greater Tree that makes up a sentence.

You may also have noticed that `'those who are always disgusted'` was not recognized as a NP, meaning that we would need to adjust our `patterns`.

In [203]:
new_patterns = """
    NP:    {<DT><WP><VBP>*<RB>*<VBN><IN><NN>}
           {<NN|NNS|NNP|NNPS><IN>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS><CC>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS>+}
           
    """

new_NPChunker = nltk.RegexpParser(new_patterns)

def prepare_text(input):
    tokenized_sentence = nltk.sent_tokenize(input)  # Tokenize the text into sentences.
    tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]  # Tokenize words in sentences.
    tagged_words = [nltk.pos_tag(word) for word in tokenized_words]  # Tag words for POS in each sentence.
    word_tree = [new_NPChunker.parse(word) for word in tagged_words]  # Identify NP chunks
    return word_tree  # Return the tagged & chunked sentences.

In [204]:
new_patterns

'\n    NP:    {<DT><WP><VBP>*<RB>*<VBN><IN><NN>}\n           {<NN|NNS|NNP|NNPS><IN>*<NN|NNS|NNP|NNPS>+}\n           {<JJ>*<NN|NNS|NNP|NNPS><CC>*<NN|NNS|NNP|NNPS>+}\n           {<JJ>*<NN|NNS|NNP|NNPS>+}\n           \n    '

Notice that our new pattern has been added at the start of our list of NP. This has been done because the NLTK will try your definitions in order; if one works, it will use it and move on.

Let's try running our function again to see if this worked.

In [205]:
prepare_text(sample_text)

[Tree('S', [Tree('NP', [('Good', 'JJ'), ('behavior', 'NN')]), ('upon', 'IN'), ('the', 'DT'), Tree('NP', [('street', 'NN')]), (',', ','), ('or', 'CC'), Tree('NP', [('public', 'JJ'), ('promenade', 'NN')]), (',', ','), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('gentleman', 'NN')]), ('most', 'RBS'), ('effectually', 'RB'), (';', ':'), Tree('NP', [('rudeness', 'NN')]), (',', ','), Tree('NP', [('incivility', 'NN')]), (',', ','), Tree('NP', [('disregard', 'NN')]), ('of', 'IN'), ('``', '``'), ('what', 'WP'), ('the', 'DT'), Tree('NP', [('world', 'NN')]), ('says', 'VBZ'), (',', ','), ("''", "''"), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('person', 'NN')]), ('of', 'IN'), Tree('NP', [('low', 'JJ'), ('breeding', 'NN')]), ('.', '.')]),
 Tree('S', [('We', 'PRP'), ('always', 'RB'), ('know', 'VBP'), (',', ','), ('in', 'IN'), ('walking', 'VBG'), ('a', 'DT'), Tree('NP', [('square', 'NN')]), ('with', 'IN'), ('a', 'DT'), Tree('NP', [('man', 'NN')]), (',', ','), ('if', 'IN'), ('he', 'PRP'), ('is', '

### Converting Nouns to NPs

In [206]:
# sentences = prepare_text(sample_text)

# def return_a_list_of_NPs(sentences):
#     nps = []  # an empty list in which to NPs will be stored.
#     for sent in sentences:
#         tree = NPChunker.parse(sent)
#         for subtree in tree.subtrees():
#             if subtree.node == 'NP':
#                 t = subtree
#                 t = ' '.join(word for word, tag in t.leaves())
#                 nps.append(t)
#     return nps

In [207]:
sentences = prepare_text(sample_text)

def return_a_list_of_NPs(sentences):
    nps = []  # an empty list in which to NPs will be stored.
    for sent in sentences:
        tree = NPChunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                t = subtree
                t = ' '.join(word for word, tag in t.leaves())
                nps.append(t)
    return nps

In [271]:
sentences

[Tree('S', [Tree('NP', [('Good', 'JJ'), ('behavior', 'NN')]), ('upon', 'IN'), ('the', 'DT'), Tree('NP', [('street', 'NN')]), (',', ','), ('or', 'CC'), Tree('NP', [('public', 'JJ'), ('promenade', 'NN')]), (',', ','), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('gentleman', 'NN')]), ('most', 'RBS'), ('effectually', 'RB'), (';', ':'), Tree('NP', [('rudeness', 'NN')]), (',', ','), Tree('NP', [('incivility', 'NN')]), (',', ','), Tree('NP', [('disregard', 'NN')]), ('of', 'IN'), ('``', '``'), ('what', 'WP'), ('the', 'DT'), Tree('NP', [('world', 'NN')]), ('says', 'VBZ'), (',', ','), ("''", "''"), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('person', 'NN')]), ('of', 'IN'), Tree('NP', [('low', 'JJ'), ('breeding', 'NN')]), ('.', '.')]),
 Tree('S', [('We', 'PRP'), ('always', 'RB'), ('know', 'VBP'), (',', ','), ('in', 'IN'), ('walking', 'VBG'), ('a', 'DT'), Tree('NP', [('square', 'NN')]), ('with', 'IN'), ('a', 'DT'), Tree('NP', [('man', 'NN')]), (',', ','), ('if', 'IN'), ('he', 'PRP'), ('is', '

In [274]:
# for sent in sentences:print(sent)
#     (S
#   (NP Good/JJ behavior/NN)
#   upon/IN
#   the/DT
#   (NP street/NN)
#   ,/,
#   or/CC
#   (NP public/JJ promenade/NN)
#   ,/,
#   marks/VBZ
#   the/DT
#   (NP gentleman/NN)
#   most/RBS
#   effectually/RB
#   ;/:
#   (NP rudeness/NN)
#   ,/,
#   (NP incivility/NN)
#   ,/,
#   (NP disregard/NN)
#   of/IN
#   ``/``
#   what/WP
#   the/DT
#   (NP world/NN)
#   says/VBZ
#   ,/,
#   ''/''
#   marks/VBZ
#   the/DT
#   (NP person/NN)
#   of/IN
#   (NP low/JJ breeding/NN)
#   ./.)
# (S
#   We/PRP
#   always/RB

In [276]:
# for sent in sentences:
#     tree = NPChunker.parse(sent)
#     print(tree)
# (S
#   (NP Good/JJ behavior/NN)
#   upon/IN
#   the/DT
#   (NP street/NN)
#   ,/,
#   or/CC
#   (NP public/JJ promenade/NN)
#   ,/,
#   marks/VBZ
#   the/DT
#   (NP gentleman/NN)
#   most/RBS
#   effectually/RB
#   ;/:
#   (NP rudeness/NN)
#   ,/,
#   (NP incivility/NN)
#   ,/,
#   (NP disregard/NN)
#   of/IN
#   ``/``
#   what/WP
#   the/DT
#   (NP world/NN)
#   says/VBZ
#   ,/,
#   ''/''
#   marks/VBZ
#   the/DT
#   (NP person/NN)
#   of/IN
#   (NP low/JJ breeding/NN)
#   ./.)
# (S
#   We/PRP
#   always/RB
#   know/VBP
    
#     for subtree in tree.subtrees():
#         if subtree.label() == 'NP':
#             t = subtree
#             t = ' '.join(word for word, tag in t.leaves())
#             nps.append(t)
#     print(nps)
    

In [286]:

# sentences = prepare_text(sample_text)
#
# def return_a_list_of_NPs(sentences):
#     nps = []  # an empty list in which to NPs will be stored.
#     for sent in sentences:
#         tree = NPChunker.parse(sent)
#         for subtree in tree.subtrees():
#             if subtree.label() == 'NP':
#                 t = subtree
#                 t = ' '.join(word for word, tag in t.leaves())
#                 nps.append(t)
#     return nps
#


# jacking around:
# for sent in sentences:
#     tree = NPChunker.parse(sent)
#     for subtree in tree.subtrees():
#         #print(subtree.label())   # will output S NP NP NP S NP NP NP NP S NP NP NP ... 
#         if subtree.label() == 'NP':
#             # print(subtree)   - output below 
#             # print(subtree.leaves())
#             t = subtree
#             t = ' '.join(tag for word, tag in t.leaves())
#             print(t)
# #             #nps.append(t)
# #     #return nps     
            
    
    
    

# print(subtree):    which will only be the subtree.label() == 'NP'        
# (NP Good/JJ behavior/NN)
# (NP street/NN)
# (NP public/JJ promenade/NN)
# (NP gentleman/NN)
# (NP rudeness/NN)
# (NP incivility/NN)
# (NP disregard/NN)
# (NP world/NN)
# (NP person/NN)
# (NP low/JJ breeding/NN)
# (NP square/NN)
# (NP man/NN)
# (NP gentleman/NN)
# (NP real/JJ gentility/NN)
# (NP following/JJ things/NNS)
# (NP street/NN)
# (NP presence/NN of/IN observers/NNS)
# (NP teeth/NN)
   
   

# print(subtree.leaves())
#
# [('Good', 'JJ'), ('behavior', 'NN')]
# [('street', 'NN')]
# [('public', 'JJ'), ('promenade', 'NN')]
# [('gentleman', 'NN')]
# [('rudeness', 'NN')]
# [('incivility', 'NN')]
# [('disregard', 'NN')]
# [('world', 'NN')]
# [('person', 'NN')]
# [('low', 'JJ'), ('breeding', 'NN')]
# [('square', 'NN')]
# [('man', 'NN')]
# [('gentleman', 'NN')]
# [('real', 'JJ'), ('gentility', 'NN')]
# [('following', 'JJ'), ('things', 'NNS')]
# [('street', 'NN')]
# [('presence', 'NN'), ('of', 'IN'), ('observers', 'NNS')]
# [('teeth', 'NN')]
# [('head', 'NN')]




# for sent in sentences:
#     tree = NPChunker.parse(sent)
#     for subtree in tree.subtrees():
#         #print(subtree.label())   # will output S NP NP NP S NP NP NP NP S NP NP NP ... 
#         if subtree.label() == 'NP':
#             # print(subtree)   - output below 
#             # print(subtree.leaves())
#             t = subtree
#             t = ' '.join(word for word, tag in t.leaves())
#             print(t)

# Good behavior
# street
# public promenade
# gentleman
# rudeness
# incivility
# disregard
# world
# person
# low breeding
# square
# man
# gentleman
# real gentility
# following things
# street
# presence of observers
# teeth
# head
# swears or talks
# nose

            

# for sent in sentences:
#     tree = NPChunker.parse(sent)
#     for subtree in tree.subtrees():
#         #print(subtree.label())   # will output S NP NP NP S NP NP NP NP S NP NP NP ... 
#         if subtree.label() == 'NP':
#             # print(subtree)   - output below 
#             # print(subtree.leaves())
#             t = subtree
#             print(t)
            
# (NP Good/JJ behavior/NN)
# (NP street/NN)
# (NP public/JJ promenade/NN)
# (NP gentleman/NN)
# (NP rudeness/NN)
# (NP incivility/NN)
# (NP disregard/NN)
# (NP world/NN)
# (NP person/NN)
# (NP low/JJ breeding/NN)
# (NP square/NN)
# (NP man/NN)
# (NP gentleman/NN)
# (NP real/JJ gentility/NN)
# (NP following/JJ things/NNS)
# (NP street/NN)
# (NP presence/NN of/IN observers/NNS)
# (NP teeth/NN)
# (NP head/NN)
# (NP swears/NNS or/CC talks/NNS)
# (NP nose/NN)
# (NP finger/NN)
# (NP smokes/NNS)
# (NP walk/NN)
# (NP annoyance/NN)
# (NP
#   those/DT
#   who/WP
#   are/VBP
#   always/RB
#   disgusted/VBN
#   with/IN
#   tobacco/NN)
# (NP shape/NN)
# (NP stares/NNS)
# (NP man/NN or/CC woman/NN)
# (NP marked/JJ manner/NN)
# (NP lady/NN)
# (NP dress/NN)
# (NP rude/JJ remarks/NNS)
# (NP crowds/NN before/IN promenaders/NNS)
# (NP hurried/JJ way/NN)
# (NP lady/NN or/CC gentleman/NN)
# (NP excuse/NN)
# (NP lady/NN)


    
# for sent in sentences:
#     tree = NPChunker.parse(sent)
#     for subtree in tree.subtrees():
#         #print(subtree.label())   # will output S NP NP NP S NP NP NP NP S NP NP NP ... 
#         if subtree.label() == 'NP':
#             # print(subtree)   - output below 
#             # print(subtree.leaves())
#             t = subtree
#             t = ' '.join(tag for word, tag in t.leaves())
#             print(t)  # specifically breaking it, to see what the freaking tags are... now i get it 
    
# JJ NN
# NN
# JJ NN
# NN
# NN
# NN
# NN
# NN


    


In [287]:
# official final output ! 
return_a_list_of_NPs(sentences)

# it is a master list, that had it grow (by adding one at a time) the (only) verbage and not the
# actual tag, i.e. only the word...

['Good behavior',
 'street',
 'public promenade',
 'gentleman',
 'rudeness',
 'incivility',
 'disregard',
 'world',
 'person',
 'low breeding',
 'square',
 'man',
 'gentleman',
 'real gentility',
 'following things',
 'street',
 'presence of observers',
 'teeth',
 'head',
 'swears or talks',
 'nose',
 'finger',
 'smokes',
 'walk',
 'annoyance',
 'those who are always disgusted with tobacco',
 'shape',
 'stares',
 'man or woman',
 'marked manner',
 'lady',
 'dress',
 'rude remarks',
 'crowds before promenaders',
 'hurried way',
 'lady or gentleman',
 'excuse',
 'lady',
 'dress',
 'pardon',
 'temper',
 'attracts attention',
 'excited conversation',
 'singular manner',
 'remark',
 'hat',
 'lady acquaintance',
 'male friend',
 'lady',
 'courtesy',
 'lady']

This is a pretty good list of NPs from the text. Let's look at `sample_text` again…this time, see if there are any NPs that _should_ have appeared above but didn't. How could add to or revise our `patterns` to make sure they were included?

<br>

### Final Code:

```python
import nltk
import re
import pprint
from nltk import Tree

patterns = """
    NP: {<JJ>*<NN*>+}
    {<JJ>*<NN*><CC>*<NN*>+}
    """

NPChunker = nltk.RegexpParser(patterns)

def prepare_text(input):
    sentences = nltk.sent_tokenize(input)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [NPChunker.parse(sent) for sent in sentences]
    return sentences


def parsed_text_to_NP(sentences):
    nps = []
    for sent in sentences:
        tree = NPChunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':    # old case would be .node... 
                t = subtree
                t = ' '.join(word for word, tag in t.leaves())
                nps.append(t)
    return nps


def sent_parse(input):
    sentences = prepare_text(input)
    nps = parsed_text_to_NP(sentences)
    return nps
    
def find_nps(text):
    prepared = prepare_text(text)
    parsed = parsed_text_to_NP(prepared)
    final = sent_parse(parsed)
```

## Conclusion

I hope the above has given you a clear idea of how the NLTK works and how it can be useful to look at chunks of language instead of single words. If you've got any questions, please don't hesitate to get in touch via [twitter](http://twitter.com/lukewrites/).

<br><br><br>

# Tom's Notes:

* https://www.nltk.org/_modules/nltk/tokenize/treebank.html

#### Tokenize Initial:

In [209]:
parag = "Hi, my name is Tom.  I am very happy to be here in this country.  I hope I last here.  Sometime you die; sometimes you live"

In [210]:
from nltk.tokenize import sent_tokenize
sent_tokenize(parag)

['Hi, my name is Tom.',
 'I am very happy to be here in this country.',
 'I hope I last here.',
 'Sometime you die; sometimes you live']

In [211]:
beta = sent_tokenize(parag)
beta[0]

'Hi, my name is Tom.'

#### Tokenize:

In [212]:
from nltk.tokenize import word_tokenize
word_tokenize("this is tom hacking your system")

['this', 'is', 'tom', 'hacking', 'your', 'system']

* http://www.nltk.org/howto/tokenize.html

In [213]:
s1 = "On a $50,000 mortgage of 30 years at 8 percent, the monthly payment would be $366.88."
nltk.word_tokenize(s1)

['On',
 'a',
 '$',
 '50,000',
 'mortgage',
 'of',
 '30',
 'years',
 'at',
 '8',
 'percent',
 ',',
 'the',
 'monthly',
 'payment',
 'would',
 'be',
 '$',
 '366.88',
 '.']

In [214]:
s11 = "I called Dr. Jones. I called Dr. Jones."
nltk.word_tokenize(s11)

['I', 'called', 'Dr.', 'Jones', '.', 'I', 'called', 'Dr.', 'Jones', '.']

#### Regex Tokenizer:

In [215]:
s = ("Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks.")
s

'Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks.'

In [216]:

s2 = ("Alas, it has not rained today. When, do you think, will it rain again?")
s2


'Alas, it has not rained today. When, do you think, will it rain again?'

In [217]:
nltk.regexp_tokenize(s2, r'[,\.\?!"]\s*', gaps=False)

[', ', '. ', ', ', ', ', '?']

In [218]:
nltk.regexp_tokenize(s2, r'[,\.\?!"]\s*', gaps=True)

['Alas',
 'it has not rained today',
 'When',
 'do you think',
 'will it rain again']

In [219]:
s3 = ("<p>Although this is <b>not</b> the case here, we must not relax our vigilance!</p>")
s3

'<p>Although this is <b>not</b> the case here, we must not relax our vigilance!</p>'

In [220]:
nltk.regexp_tokenize(s3, r'</?(b|p)>', gaps=False)

['p', 'b', 'b', 'p']

In [221]:
nltk.regexp_tokenize(s3, r'</?(b|p)>', gaps=True)

['p',
 'Although this is ',
 'b',
 'not',
 'b',
 ' the case here, we must not relax our vigilance!',
 'p']

#### Chunking:

In [222]:
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *
from nltk import Tree

In [223]:
tagged_text = "[ The/DT cat/NN ] sat/VBD on/IN [ the/DT mat/NN ] [ the/DT dog/NN ] chewed/VBD ./."

In [224]:
gold_chunked_text = tagstr2tree(tagged_text)

In [225]:
unchunked_text = gold_chunked_text.flatten()

Chunking uses a special regexp syntax for rules that delimit the chunks. These rules must be converted to 'regular' regular expressions before a sentence can be chunked.

In [226]:
tag_pattern = "<DT>?<JJ>*<NN.*>"
regexp_pattern = tag_pattern2re_pattern(tag_pattern)
regexp_pattern

'(<(DT)>)?(<(JJ)>)*(<(NN[^\\{\\}<>]*)>)'

Construct some new chunking rules.



In [227]:
chunk_rule = ChunkRule("<.*>+", "Chunk everything")
chink_rule = ChinkRule("<VBD|IN|\.>", "Chink on verbs/prepositions")
split_rule = SplitRule("<DT><NN>", "<DT><NN>", "Split successive determiner/noun pairs")

In [228]:
chunk_parser = RegexpChunkParser([chunk_rule], chunk_label='NP')
chunked_text = chunk_parser.parse(unchunked_text)
print(chunked_text)

(S
  (NP
    The/DT
    cat/NN
    sat/VBD
    on/IN
    the/DT
    mat/NN
    the/DT
    dog/NN
    chewed/VBD
    ./.))


###  ABC

In [229]:
from nltk import pos_tag
sent = 'The pizza was awesome and brilliant'.split()
pos_tag(sent)

[('The', 'DT'),
 ('pizza', 'NN'),
 ('was', 'VBD'),
 ('awesome', 'JJ'),
 ('and', 'CC'),
 ('brilliant', 'JJ')]

In [230]:
sent2 = 'The pizza was awesome and brilliant'
pos_tag(word_tokenize(sent2))  # just odd how that went down 

[('The', 'DT'),
 ('pizza', 'NN'),
 ('was', 'VBD'),
 ('awesome', 'JJ'),
 ('and', 'CC'),
 ('brilliant', 'JJ')]

In [231]:
sent = 'The pizza was good but pasta was bad'.split()
pos_tag(sent)

[('The', 'DT'),
 ('pizza', 'NN'),
 ('was', 'VBD'),
 ('good', 'JJ'),
 ('but', 'CC'),
 ('pasta', 'NN'),
 ('was', 'VBD'),
 ('bad', 'JJ')]

In [232]:

# from nltk import RegexpParser

# sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant']
# sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad']


# #  I'm trying to capture these types of patterns:
# #      NN VBD JJ CC JJ
# #      NN VBD JJ

# patterns = """P:{<NN><VBD><JJ><CC><JJ>}{<NN><VBD><JJ>}"""

# PChunker = RegexpParser(patterns)


In [233]:

PChunker.parse(pos_tag(sent1))


NameError: name 'PChunker' is not defined

In [None]:

PChunker.parse(pos_tag(sent2))


In [None]:


# >>> patterns = """
# ... P: {<NN><VBD><JJ>(<CC><JJ>)?}
# ... """
# >>> PChunker = RegexpParser(patterns)
# >>> PChunker.parse(pos_tag(sent1))
# Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
# >>> PChunker.parse(pos_tag(sent2))
# Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])


patterns = """P: {<NN><VBD><JJ>(<CC><JJ>)?}"""


PChunker = RegexpParser(patterns)
PChunker.parse(pos_tag(sent1))


# Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])


In [None]:

PChunker.parse(pos_tag(sent2))
# Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])

# Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])


#### regex

In [None]:

from nltk.tokenize import regexp_tokenize

sent1 = "I can't do this.  I can't do that.  I don't want to eat food."\

word_tokenize(sent1)


In [None]:

regexp_tokenize(sent1, "[\w']+")
# effectively this is saying dont break it !   its a nice clean simple way to reverse what it did
# automatically with word_tokenize...


In [None]:

regexp_tokenize(sent1, "[\w]+")



In [None]:

regexp_tokenize(sent1, "[\w']")



In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize(sent1)


#### Stop Words

In [None]:
from nltk.corpus import stopwords
ensw = stopwords.words('english')

In [None]:
ensw

In [None]:

from nltk.tokenize import word_tokenize

parag1 = "What are you doing exactly right now ?  I am trying to get to my office"

# i want to filter out the sentence, but remove the stop words

paragArr = word_tokenize(parag1)  # but see how you have stop words ? 


In [None]:
paragArr

In [None]:

filterArr = [item for item in paragArr if item not in ensw]
filterArr  # but doesn't work so well ? 


In [None]:
# Smarter way of parsing:
    

In [None]:
# >>> def content_fraction(text):
# ...     stopwords = nltk.corpus.stopwords.words('english')
# ...     content = [w for w in text if w.lower() not in stopwords]
# ...     return len(content) / len(text)

newsent = "I am so freaking tired and I must slow down.  It is time to now officially go home and say goodbye."

unfiltered_words = nltk.word_tokenize(newsent)

In [None]:

stopwords = nltk.corpus.stopwords.words('english')

cont = [w for w in unfiltered_words if w.lower() not in stopwords]
cont

# see how I, and, to, etc are no longer in the words ???


<br>
<hr>
<br>

In [262]:

sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

grammar = "NP: {<DT>?<JJ>*<NN>}"

cp = nltk.RegexpParser(grammar)

result = cp.parse(sentence)

print(result)


(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


In [268]:

import nltk.draw
from nltk.draw.util import CanvasFrame
from nltk.draw import TreeWidget


<br>
<hr>
<br>