___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Parts of Speech Assessment

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

**1. Create a Doc object from the file `peterrabbit.txt`**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [2]:

with open('../TextFiles/peterrabbit.txt') as f:
    doc = nlp(f.read())

**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [3]:
for token in list(doc.sents)[2]:
    print(f"{token.text:{10}} {token.pos_:{5}} {token.tag_:{5}} {str(spacy.explain(token.tag_))}")



         SPACE _SP   whitespace
They       PRON  PRP   pronoun, personal
lived      VERB  VBD   verb, past tense
with       ADP   IN    conjunction, subordinating or preposition
their      PRON  PRP$  pronoun, possessive
Mother     PROPN NNP   noun, proper singular
in         ADP   IN    conjunction, subordinating or preposition
a          DET   DT    determiner
sand       NOUN  NN    noun, singular or mass
-          PUNCT HYPH  punctuation mark, hyphen
bank       NOUN  NN    noun, singular or mass
,          PUNCT ,     punctuation mark, comma
underneath ADP   IN    conjunction, subordinating or preposition
the        DET   DT    determiner
root       NOUN  NN    noun, singular or mass
of         ADP   IN    conjunction, subordinating or preposition
a          DET   DT    determiner

          SPACE _SP   whitespace
very       ADV   RB    adverb
big        ADJ   JJ    adjective (English), other noun-modifier (Chinese)
fir        NOUN  NN    noun, singular or mass
-          PUNCT H

In [3]:
# Enter your code here:




They         PRON   PRP    pronoun, personal
lived        VERB   VBD    verb, past tense
with         ADP    IN     conjunction, subordinating or preposition
their        ADJ    PRP$   pronoun, possessive
Mother       PROPN  NNP    noun, proper singular
in           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner
sand         NOUN   NN     noun, singular or mass
-            PUNCT  HYPH   punctuation mark, hyphen
bank         NOUN   NN     noun, singular or mass
,            PUNCT  ,      punctuation mark, comma
underneath   ADP    IN     conjunction, subordinating or preposition
the          DET    DT     determiner
root         NOUN   NN     noun, singular or mass
of           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner

            SPACE         None
very         ADV    RB     adverb
big          ADJ    JJ     adjective
fir          NOUN   NN     noun, singular or mass
-            PUNCT 

**3. Provide a frequency list of POS tags from the entire document**

In [4]:
POS_counts = doc.count_by(spacy.attrs.POS)

for key, value in sorted(POS_counts.items()):
    print(f"{key:{5}} {doc.vocab[key].text:{10}} {value:{5}}")

   84 ADJ           56
   85 ADP          124
   86 ADV           63
   87 AUX           49
   89 CCONJ         61
   90 DET           91
   92 NOUN         170
   93 NUM            8
   94 PART          30
   95 PRON         108
   96 PROPN         73
   97 PUNCT        171
   98 SCONJ         20
  100 VERB         135
  103 SPACE         99


83. ADJ  : 83
84. ADP  : 127
85. ADV  : 75
88. CCONJ: 61
89. DET  : 90
91. NOUN : 176
92. NUM  : 8
93. PART : 36
94. PRON : 72
95. PROPN: 75
96. PUNCT: 174
99. VERB : 182
102. SPACE: 99


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is 91

In [5]:
print(len(doc))

1258


In [9]:
POS_counts[92]

170

In [11]:
170 / 1258 * 100

13.513513513513514

176/1258 = 13.99%


**5. Display the Dependency Parse for the third sentence**

In [12]:
displacy.render(list(doc.sents)[2], style = "dep", jupyter=True)

**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit* **

In [13]:
doc.ents

(Beatrix Potter,
 1902,
 four,
 Rabbits,
 were--
 
           Flopsy,
 one morning,
 McGregor,
 Rabbit,
 five,
 Mopsy,
 Cottontail,
 McGregor,
 French,
 McGregor,
 Peter,
 four,
 McGregor,
 Peter,
 McGregor,
 McGregor,
 Peter,
 three,
 McGregor,
 Peter,
 Benjamin Bunny,
 McGregor,
 Peter,
 McGregor,
 McGregor,
 second,
 Peter,
 One,
 Flopsy,
 Mopsy,
 Cotton,
 END)

In [14]:
doc.ents[:2]

(Beatrix Potter, 1902)

In [16]:
for entit in doc.ents[:2]:
    print(entit.text + "  " + entit.label_ + "  " + str(spacy.explain(entit.label_)))

Beatrix Potter  PERSON  People, including fictional
1902  DATE  Absolute or relative dates or periods


The Tale of Peter Rabbit - WORK_OF_ART - Titles of books, songs, etc.
Beatrix Potter - PERSON - People, including fictional


**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [17]:
len(list(doc.sents))

54

56

**8. CHALLENGE: How many sentences contain named entities?**

In [19]:
list_of_sents = [nlp(sent.text) for sent in doc.sents]
list_of_ners = [doc for doc in list_of_sents if doc.ents]

len(list_of_ners)

24

49

**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [20]:
displacy.render(list_of_sents[0], style="ent", jupyter=True)

### Great Job!