# NLP Chunking and POS - Winnie the Pooh

Novels and text contain insights into ideologies and places that are often originally unknown to the reader. By reading a written piece, you uncover the opinions of the author on their chosen topic and come to understand both the topic and how the author thinks.

By the end of this project, you will find out the main topics of discussion in the novel of your choosing and can begin to discern some of the author's thoughts and beliefs!

## Import and Preprocess Text Data

In [64]:
from nltk import pos_tag, RegexpParser
from nltk.tokenize import PunktSentenceTokenizer, word_tokenize
from collections import Counter

# import text of winnie the pooh - from project gutenberg
text = open("winnie.txt",encoding='utf-8').read().lower()

In [65]:
# Sentence Tokenizer
def word_sentence_tokenize(text):
  
  # create a PunktSentenceTokenizer
  sentence_tokenizer = PunktSentenceTokenizer(text)
  
  # sentence tokenize text
  sentence_tokenized = sentence_tokenizer.tokenize(text)
  
  # create a list to hold word tokenized sentences
  word_tokenized = list()
  
  # for-loop through each tokenized sentence in sentence_tokenized
  for tokenized_sentence in sentence_tokenized:
    # word tokenize each sentence and append to word_tokenized
    word_tokenized.append(word_tokenize(tokenized_sentence))
    
  return word_tokenized

In [85]:
# function that pulls chunks out of chunked sentence and finds the most common chunks
def np_chunk_counter(chunked_sentences):

    # create a list to hold chunks
    chunks = list()

    # for-loop through each chunked sentence to extract noun phrase chunks
    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'NP'):
            chunks.append(tuple(subtree))

    # create a Counter object
    chunk_counter = Counter()

    # for-loop through the list of chunks
    for chunk in chunks:
        # increase counter of specific chunk by 1
        chunk_counter[chunk] += 1

    # return 30 most frequent chunks
    return chunk_counter.most_common(30)


# function that pulls chunks out of chunked sentence and finds the most common chunks
def vp_chunk_counter(chunked_sentences):

    # create a list to hold chunks
    chunks = list()

    # for-loop through each chunked sentence to extract verb phrase chunks
    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'VP'):
            chunks.append(tuple(subtree))

    # create a Counter object
    chunk_counter = Counter()

    # for-loop through the list of chunks
    for chunk in chunks:
        # increase counter of specific chunk by 1
        chunk_counter[chunk] += 1

    # return 30 most frequent chunks
    return chunk_counter.most_common(30)


In [67]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/nicknut/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [68]:
# sentence and word tokenize text here
word_tokenized_text = word_sentence_tokenize(text)

In [69]:
# store and print any word tokenized sentence here
single_word_tokenized_sentence = word_tokenized_text[707]

print(single_word_tokenized_sentence)

['``', 'good', 'morning', ',', 'pooh', 'bear', ',', "''", 'said', 'eeyore', 'gloomily', '.']


## Part-of-speech Tag Text

In [70]:
# create a list to hold part-of-speech tagged sentences here
pos_tagged_text = []

In [71]:
# create a for loop through each word tokenized sentence here

  # part-of-speech tag each sentence and append to list of pos-tagged sentences here
for tokenized_sentence in word_tokenized_text:
    pos_tagged_text.append(pos_tag(tokenized_sentence))


In [72]:
# store and print any part-of-speech tagged sentence here
single_pos_sentence = pos_tagged_text[707]

print(single_pos_sentence)

[('``', '``'), ('good', 'JJ'), ('morning', 'NN'), (',', ','), ('pooh', 'NN'), ('bear', 'NN'), (',', ','), ("''", "''"), ('said', 'VBD'), ('eeyore', 'RBR'), ('gloomily', 'RB'), ('.', '.')]


## Chunk Sentences

In [73]:
# define noun phrase chunk grammar
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN.*>}"

In [74]:
# create noun phrase RegexpParser
np_chunk_parser = RegexpParser(np_chunk_grammar)

In [75]:
# define verb phrase chunk grammar
vp_chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"


In [76]:
# create verb phrase RegexpParser object
vp_chunk_parser = RegexpParser(vp_chunk_grammar)


In [77]:
# create lists to hold NP and VP chunked sentences
np_chunked_text = []
vp_chunked_text = []

In [78]:
# create a for loop through each pos-tagged sentence
for pos_sentence in pos_tagged_text:
  # NP and VP chunk each sentence and append to respective list
    np_chunked_text.append(np_chunk_parser.parse(pos_sentence))
    vp_chunked_text.append(vp_chunk_parser.parse(pos_sentence))

## Analyze Chunks

In [79]:
# store and print the most common NP-chunks here
most_common_np_chunks = np_chunk_counter(np_chunked_text)
print(most_common_np_chunks)


[((('pooh', 'NN'),), 293), ((('i', 'NN'),), 230), ((('piglet', 'NN'),), 179), ((('robin', 'NN'),), 144), ((('rabbit', 'NN'),), 101), ((('christopher', 'NN'),), 65), ((('owl', 'NN'),), 64), ((('kanga', 'NN'),), 60), ((('roo', 'NN'),), 57), ((('_', 'NN'),), 53), ((('something', 'NN'),), 46), ((('eeyore', 'NN'),), 42), ((('honey', 'NN'),), 37), ((('head', 'NN'),), 30), ((('i', 'NNS'),), 28), ((('anything', 'NN'),), 28), ((('home', 'NN'),), 27), ((('winnie-the-pooh', 'NN'),), 24), ((('nothing', 'NN'),), 23), ((('bear', 'NN'),), 22), ((('*', 'NNP'),), 21), ((('house', 'NN'),), 21), ((('the', 'DT'), ('water', 'NN')), 21), ((('course', 'NN'),), 20), ((('the', 'DT'), ('forest', 'NN')), 19), ((('oh', 'NN'),), 19), ((('hallo', 'NN'),), 18), ((('round', 'NN'),), 16), ((('dear', 'NN'),), 16), ((('case', 'NN'),), 15)]


In [80]:
# store and print the most common VP-chunks here
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)
print(most_common_vp_chunks)

[((('i', 'NN'), ('do', 'VBP'), ("n't", 'RB')), 16), ((('i', 'NN'), ('was', 'VBD')), 15), ((('i', 'NN'), ('said', 'VBD')), 10), ((('pooh', 'NN'), ('was', 'VBD')), 8), ((('i', 'NN'), ("'m", 'VBP')), 8), ((('piglet', 'NN'), ('said', 'VBD')), 8), ((('i', 'NN'), ('am', 'VBP')), 8), ((('i', 'NN'), ('suppose', 'VBP')), 7), ((('i', 'NN'), ('know', 'VBP')), 7), ((('i', 'NN'), ('think', 'VBP')), 7), ((('i', 'NN'), ('did', 'VBD'), ("n't", 'RB')), 7), ((('i', 'NN'), ('thought', 'VBD')), 7), ((('piglet', 'NN'), ('was', 'VBD')), 7), ((('robin', 'NN'), ('had', 'VBD')), 7), ((('i', 'NN'), ('did', 'VBD')), 7), ((('pooh', 'NN'), ('looked', 'VBD')), 6), ((('pooh', 'NN'), ('said', 'VBD')), 6), ((('robin', 'NN'), ('said', 'VBD')), 5), ((('owl', 'NN'), ('was', 'VBD')), 5), ((('i', 'NN'), ('had', 'VBD')), 5), ((('i', 'NN'), ("'ve", 'VBP')), 5), ((('i', 'NN'), ('have', 'VBP')), 4), ((('i', 'NN'), ('do', 'VBP')), 4), ((('rabbit', 'NN'), ('said', 'VBD')), 4), ((('robin', 'NN'), ('is', 'VBZ')), 4), ((('i', 'NN')

## Observations

- Unsurprisingly, Pooh is the most commonly mentioned noun
- Many of the other characters are also the top mentioned NPs. This gives us an idea who the main characters are.
- Eeyore confused the POS tagger - it somehow is tagged as an adverb, which we saw earlier.
- `I` is very common, too - much of the book must use first person, probaably in dialog
- The top VPs are `I don't` and `I was` and `I said` ... which fits with the above observation
- The rest of the VPs tend to describe what the characters are doing `piglet was`, `pooh said`, etc.