# GIAN 4: Resources and Recognition

You should now be familiar with tokenizing text and using frequency measures.

This notebook will show you how to use annotated text resources for tokenizing, lemmatizing, part-of-speech tagging, parsing, and named entity recognition.

In [1]:
# load the libraries required for this session
import re
from collections import Counter, defaultdict
import spacy

from matplotlib import pyplot as plt

In [2]:
# Define some essential functions for this session

def clean_gutenberg_text(text):
    """Remove front and back matter from project Gutenberg texts"""
    m1 = re.search("THE MILLENNIUM FULCRUM EDITION 1.2+\n", text)
    m2 = re.search("End of Project Gutenberg’s The Hunting of the Snark, by Lewis Carroll+\n", text)
    tstart=m1.span()[1]+1 # Text starts one character after the end of the front matter
    tstop=m2.span()[0]  # Text ends one character before the beginning of the back matter
    # spaCy's tokenizer doesn't like newlines
    text=re.sub("\n+", "\n", text[tstart:tstop])
    return(text)



# Running the full NLP pipeline

In [3]:
spacy.util.use_gpu(0)
#nlp=spacy.blank("en")
import en_core_web_sm
nlp = en_core_web_sm.load(disable=["parse", "tag", "ner"])

In [4]:
t0 = open('0.txt', encoding="utf-8").read()
t1 = clean_gutenberg_text(t0)

In [5]:
len(t1)

32317

Now we can run the entire nlp pipeline on the book. 

(Warning: This will take much longer than just tokenizing)

In [6]:
doc1=nlp(t1)

Let's look at the first 100 tokens at the beginning of chapter 1 (which starts at token 368)

In [7]:
tokenn =[]
for token in doc1:
    if(token.pos_ != "SPACE" and token.pos_ != "PUNCT" ):
        print(token.pos,token.text, token.text.lower(), token.pos_)
        tokenn.append(token.text.lower())

89 THE the DET
91 HUNTING hunting NOUN
84 OF of ADP
89 THE the DET
95 SNARK snark PROPN
89 an an DET
95 Agony agony PROPN
84 in in ADP
92 Eight eight NUM
95 Fits fits PROPN
84 by by ADP
95 Lewis lewis PROPN
95 Carroll carroll PROPN
95 PREFACE preface PROPN
84 If if ADP
88 and and CCONJ
89 the the DET
91 thing thing NOUN
99 is is VERB
85 wildly wildly ADV
83 possible possible ADJ
89 the the DET
91 charge charge NOUN
84 of of ADP
99 writing writing VERB
91 nonsense nonsense NOUN
99 were were VERB
85 ever ever ADV
99 brought brought VERB
84 against against ADP
89 the the DET
91 author author NOUN
84 of of ADP
89 this this DET
83 brief brief ADJ
88 but but CCONJ
83 instructive instructive ADJ
91 poem poem NOUN
94 it it PRON
99 would would VERB
99 be be VERB
99 based based VERB
94 I i PRON
99 feel feel VERB
83 convinced convinced ADJ
84 on on ADP
89 the the DET
91 line line NOUN
84 in in ADP
91 p.4 p.4 NOUN
85 Then then ADV
89 the the DET
91 bowsprit bowsprit NOUN
99 got got VERB
99 mixed m

85 perhaps perhaps ADV
99 have have VERB
99 won won VERB
83 more more ADJ
84 than than ADP
83 his his ADJ
91 share-- share-- NOUN
88 But but CCONJ
89 a a DET
95 Banker banker PROPN
99 engaged engaged VERB
84 at at ADP
83 enormous enormous ADJ
91 expense expense NOUN
99 Had had VERB
89 the the DET
91 whole whole NOUN
84 of of ADP
83 their their ADJ
91 cash cash NOUN
84 in in ADP
83 his his ADJ
91 care care NOUN
85 There there ADV
99 was was VERB
85 also also ADV
89 a a DET
95 Beaver beaver PROPN
83 that that ADJ
99 paced paced VERB
84 on on ADP
89 the the DET
91 deck deck NOUN
88 Or or CCONJ
99 would would VERB
99 sit sit VERB
99 making making VERB
91 lace lace NOUN
84 in in ADP
89 the the DET
91 bow bow NOUN
88 And and CCONJ
99 had had VERB
85 often often ADV
89 the the DET
95 Bellman bellman PROPN
99 said said VERB
99 saved saved VERB
94 them them PRON
84 from from ADP
91 wreck wreck NOUN
84 Though though ADP
91 none none NOUN
84 of of ADP
89 the the DET
91 sailors sailors NOUN
99 kne

88 or or CCONJ
84 for for ADP
91 sale sale NOUN
92 Two two NUM
83 excellent excellent ADJ
91 Policies policies NOUN
92 one one NUM
84 Against against ADP
95 Fire fire PROPN
88 And and CCONJ
92 one one NUM
84 Against against ADP
91 Damage damage NOUN
84 From from ADP
95 Hail hail PROPN
85 Yet yet ADV
85 still still ADV
85 ever ever ADV
84 after after ADP
89 that that DET
83 sorrowful sorrowful ADJ
91 day day NOUN
85 Whenever whenever ADV
89 the the DET
95 Butcher butcher PROPN
99 was was VERB
85 by by ADV
89 The the DET
95 Beaver beaver PROPN
99 kept kept VERB
99 looking looking VERB
89 the the DET
83 opposite opposite ADJ
91 way way NOUN
88 And and CCONJ
99 appeared appeared VERB
85 unaccountably unaccountably ADV
83 shy shy ADJ
95 Fit fit PROPN
89 the the DET
95 Second second PROPN
89 THE the DET
95 BELLMAN bellman PROPN
93 ’S ’s PART
95 SPEECH speech PROPN
89 The the DET
95 Bellman bellman PROPN
94 himself himself PRON
94 they they PRON
89 all all DET
99 praised praised VERB
84 to to

99 roused roused VERB
94 him him PRON
84 with with ADP
91 jam jam NOUN
88 and and CCONJ
83 judicious judicious ADJ
95 advice-- advice-- PROPN
94 They they PRON
99 set set VERB
94 him him PRON
91 conundrums conundrums NOUN
93 to to PART
99 guess guess VERB
85 When when ADV
84 at at ADP
91 length length NOUN
94 he he PRON
99 sat sat VERB
93 up up PART
88 and and CCONJ
99 was was VERB
83 able able ADJ
93 to to PART
99 speak speak VERB
83 His his ADJ
83 sad sad ADJ
91 story story NOUN
94 he he PRON
99 offered offered VERB
93 to to PART
99 tell tell VERB
88 And and CCONJ
89 the the DET
95 Bellman bellman PROPN
99 cried cried VERB
95 Silence silence PROPN
85 Not not ADV
85 even even ADV
89 a a DET
91 shriek shriek NOUN
88 And and CCONJ
85 excitedly excitedly ADV
99 tingled tingled VERB
83 his his ADJ
91 bell bell NOUN
85 There there ADV
99 was was VERB
91 silence silence NOUN
85 supreme supreme ADV
85 Not not ADV
89 a a DET
83 shriek shriek ADJ
85 not not ADV
89 a a DET
91 scream scream NOUN

89 a a DET
91 maxim maxim NOUN
83 tremendous tremendous ADJ
88 but but CCONJ
91 trite trite NOUN
88 And and CCONJ
94 you you PRON
99 ’d ’d VERB
85 best best ADV
99 be be VERB
99 unpacking unpacking VERB
89 the the DET
91 things things NOUN
83 that that ADJ
94 you you PRON
99 need need VERB
93 To to PART
99 rig rig VERB
91 yourselves yourselves NOUN
93 out out PART
84 for for ADP
89 the the DET
91 fight fight NOUN
85 Then then ADV
89 the the DET
95 Banker banker PROPN
99 endorsed endorsed VERB
89 a a DET
83 blank blank ADJ
91 cheque cheque NOUN
83 which which ADJ
94 he he PRON
99 crossed crossed VERB
88 And and CCONJ
99 changed changed VERB
83 his his ADJ
83 loose loose ADJ
91 silver silver NOUN
84 for for ADP
91 notes notes NOUN
89 The the DET
95 Baker baker PROPN
84 with with ADP
91 care care NOUN
99 combed combed VERB
83 his his ADJ
91 whiskers whiskers NOUN
88 and and CCONJ
91 hair hair NOUN
88 And and CCONJ
99 shook shook VERB
89 the the DET
91 dust dust NOUN
84 out out ADP
84 of o

84 without without ADP
91 introduction introduction NOUN
99 Would would VERB
99 have have VERB
99 caused caused VERB
83 quite quite ADJ
89 a a DET
91 thrill thrill NOUN
84 in in ADP
95 Society society PROPN
84 As as ADP
93 to to PART
99 temper temper VERB
89 the the DET
95 Jubjub jubjub PROPN
95 ’s ’s PROPN
89 a a DET
83 desperate desperate ADJ
91 bird bird NOUN
84 Since since ADP
94 it it PRON
99 lives lives VERB
84 in in ADP
83 perpetual perpetual ADJ
91 passion passion NOUN
83 Its its ADJ
91 taste taste NOUN
84 in in ADP
91 costume costume NOUN
99 is is VERB
85 entirely entirely ADV
90 absurd-- absurd-- INTJ
94 It it PRON
99 is is VERB
91 ages ages NOUN
85 ahead ahead ADV
84 of of ADP
89 the the DET
91 fashion fashion NOUN
88 But but CCONJ
94 it it PRON
99 knows knows VERB
89 any any DET
91 friend friend NOUN
94 it it PRON
99 has has VERB
99 met met VERB
85 once once ADV
85 before before ADV
94 It it PRON
85 never never ADV
99 will will VERB
99 look look VERB
84 at at ADP
89 a a DET

84 As as ADP
89 the the DET
91 pig pig NOUN
99 had had VERB
99 been been VERB
83 dead dead ADJ
84 for for ADP
89 some some DET
91 years years NOUN
89 The the DET
95 Judge judge PROPN
99 left left VERB
89 the the DET
95 Court court PROPN
99 looking looking VERB
85 deeply deeply ADV
83 disgusted disgusted ADJ
88 But but CCONJ
89 the the DET
95 Snark snark PROPN
84 though though ADP
89 a a DET
83 little little ADJ
91 aghast aghast NOUN
84 As as ADP
89 the the DET
91 lawyer lawyer NOUN
93 to to PART
91 whom whom NOUN
89 the the DET
91 defense defense NOUN
99 was was VERB
99 entrusted entrusted VERB
99 Went went VERB
99 bellowing bellowing VERB
93 on on PART
84 to to ADP
89 the the DET
83 last last ADJ
85 Thus thus ADV
89 the the DET
95 Barrister barrister PROPN
99 dreamed dreamed VERB
84 while while ADP
89 the the DET
91 bellowing bellowing NOUN
99 seemed seemed VERB
93 To to PART
99 grow grow VERB
89 every every DET
91 moment moment NOUN
85 more more ADV
83 clear clear ADJ
84 Till till AD

In [52]:
#tokenn= [token.text for token in doc1]

In [8]:
print(len(tokenn))

5105


In [21]:
import collections
from math import *
counter=collections.Counter(tokenn)

val= list(counter.values())

name= list(counter.keys())
norm_freq=[]
log_freq =[]
for i in  val:
    norm_freq.append(i/len(tokenn))
    log_freq.append(log10(((i+1)/(len(tokenn)+len(counter))*1000000)))

for i in range (0,len(val)-1):
    print(str(name[i])+"  "+ str(val[i]) +"  "+ str(norm_freq[i])+"  "+ str(log_freq[i]))

the  350  0.06856023506366307  4.728601932799308
hunting  4  0.0007835455435847208  2.8822648206695036
of  94  0.01841332027424094  4.161018421622332
snark  32  0.006268364348677767  3.701808756211372
an  9  0.001762977473065622  3.183294816333485
agony  1  0.0001958863858961802  2.4843248119974657
in  93  0.01821743388834476  4.156422669933184
eight  2  0.0003917727717923604  2.660416071053147
fits  1  0.0001958863858961802  2.4843248119974657
by  19  0.0037218413320274243  3.4843248119974657
lewis  1  0.0001958863858961802  2.4843248119974657
carroll  1  0.0001958863858961802  2.4843248119974657
preface  1  0.0001958863858961802  2.4843248119974657
if  17  0.0033300685602350635  3.438567321436791
and  165  0.03232125367286973  4.403402904373539
thing  10  0.0019588638589618022  3.22468750149171
is  43  0.008423114593535749  3.8267474928196723
wildly  1  0.0001958863858961802  2.4843248119974657
possible  2  0.0003917727717923604  2.660416071053147
charge  5  0.0009794319294809011  2.

In [79]:
len(counter)

1452

Since spaCy has tagged the tokens for us, we can now look at the lemmas corresponding to the tokens, ...

In [26]:
d1_lemmas=[token.lemma_ for token in doc1]
print(d1_lemmas[i:i+n])

['remonstrance', 'be', 'impossible', ',', 'and', 'no', 'steer', 'can', 'be', 'do', '\n', 'till', 'the', 'next', 'varnish', 'day', '.', ' ', 'During', 'this', 'bewilder', 'interval', 'the', '\n', 'ship', 'usually', 'sail', 'backwards', '.', '\n', 'As', 'this', 'poem', 'be', 'to', 'some', 'extent', 'connect', 'with', 'the', 'lie', 'of', 'the', 'Jabberwock', ',', '\n', 'let', 'me', 'take', 'this', 'opportunity', 'of', 'answer', 'a', 'question', 'that', 'have', 'often', 'be', '\n', 'ask', 'me', ',', 'how', 'to', 'pronounce', '“', 'slithy', 'toves', '.', '”', ' ', 'The', '“', 'i', '”', 'in', '“', 'slithy', '”', 'be', 'long', ',', '\n', 'a', 'in', '“', 'writhe', '”', ';', 'and', '“', 'toves', '”', 'be', 'pronounce', 'so', 'a', 'to', 'rhyme']


In [27]:
# as an illustration, the following code achieves the same result without list comprehension
d1_lemmas=[]
for token in doc1:
    d1_lemmas.append(token.lemma_)
print(d1_lemmas[i:i+n])

['remonstrance', 'be', 'impossible', ',', 'and', 'no', 'steer', 'can', 'be', 'do', '\n', 'till', 'the', 'next', 'varnish', 'day', '.', ' ', 'During', 'this', 'bewilder', 'interval', 'the', '\n', 'ship', 'usually', 'sail', 'backwards', '.', '\n', 'As', 'this', 'poem', 'be', 'to', 'some', 'extent', 'connect', 'with', 'the', 'lie', 'of', 'the', 'Jabberwock', ',', '\n', 'let', 'me', 'take', 'this', 'opportunity', 'of', 'answer', 'a', 'question', 'that', 'have', 'often', 'be', '\n', 'ask', 'me', ',', 'how', 'to', 'pronounce', '“', 'slithy', 'toves', '.', '”', ' ', 'The', '“', 'i', '”', 'in', '“', 'slithy', '”', 'be', 'long', ',', '\n', 'a', 'in', '“', 'writhe', '”', ';', 'and', '“', 'toves', '”', 'be', 'pronounce', 'so', 'a', 'to', 'rhyme']


the corresponding *general* part-of-speech tags, ...

In [13]:
d1_pos=[token.pos_ for token in doc1]
print(d1_pos[i:i+n])

['', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '', 'SPACE', '', '', '', '', '', '']


and the corresponding *detailed* part-of-speech tags.

In [14]:
d1_pos=[token.tag_ for token in doc1]
print(d1_pos[i:i+n])

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


An overview of the general tags can be found [here](https://spacy.io/usage/linguistic-features) and [here](https://spacy.io/api/annotation) and the detailed tags can be found [here](https://www.researchgate.net/profile/Jinho_Choi3/publication/324940566_Guidelines_for_the_CLEAR_Style_Constituent_to_Dependency_Conversion/links/5aebd3cfa6fdcc8508b6e6e8/Guidelines-for-the-CLEAR-Style-Constituent-to-Dependency-Conversion.pdf).

Meanwhile, here is a small cheat sheet.

![part-of-speech-tags](GIAN4_data/postags.png)

Let's look at the correspondence between token, lemma, and part-of-speech tag in more detail

In [15]:
for token in doc1[i:i+n]:
    if token.orth_.isalpha(): # remove punctuation
        print ("{:s} => {:s} ({:s}, {:s})".format(token.orth_, token.lemma_, token.pos_, token.tag_))

It => It (, )
was => be (, )
the => the (, )
best => well (, )
of => of (, )
times => time (, )
it => it (, )
was => be (, )
the => the (, )
worst => wrong (, )
of => of (, )
times => time (, )
it => it (, )
was => be (, )
the => the (, )
age => age (, )
of => of (, )
wisdom => wisdom (, )
it => it (, )
was => be (, )
the => the (, )
age => age (, )
of => of (, )
foolishness => foolishness (, )
it => it (, )
was => be (, )
the => the (, )
epoch => epoch (, )
of => of (, )
belief => belief (, )
it => it (, )
was => be (, )
the => the (, )
epoch => epoch (, )
of => of (, )
incredulity => incredulity (, )
it => it (, )
was => be (, )
the => the (, )
season => season (, )
of => of (, )
Light => Light (, )
it => it (, )
was => be (, )
the => the (, )
season => season (, )
of => of (, )
Darkness => Darkness (, )
it => it (, )
was => be (, )
the => the (, )
spring => spring (, )
of => of (, )
hope => hope (, )
it => it (, )
was => be (, )
the => the (, )
winter => winter (, )
of => of (, )
de

## Sentences

Because spaCy has parsed the entire text, we can also look at sentences

In [None]:
sentences=list(doc1.sents)
for i, sentence in enumerate(sentences[:100]):
    print(i,sentence)

In [None]:
sentences[58]

In [None]:
from spacy import displacy
displacy.render(sentences[58], style='dep', jupyter=True)

## Named Entity Recognition

Running spaCy's nlp pipeline also includes [named entity recognition](https://spacy.io/usage/linguistic-features#section-named-entities)

Let's look at the first 20 named entities in the book

In [None]:
print([ent.orth_ for ent in doc1.ents[:100]])

The same thing, but now with named entity labels

In [None]:
print([(ent.orth_, ent.label_) for ent in doc1.ents[:100]])

In [None]:
displacy.render(doc1[:500], style='ent', jupyter=True)

We will now make a frequency dictionary of the persons and geopolitical entities in the document

In [None]:
# persons
doc1_person=Counter([(ent.orth_, ent.label_) for ent in doc1.ents if ent.label_=="PERSON"])
# geo-political entities
doc1_gpe=Counter([(ent.orth_, ent.label_) for ent in doc1.ents if ent.label_=="GPE"])
# dates
doc1_dates=Counter([(ent.orth_, ent.label_) for ent in doc1.ents if ent.label_=="DATE"])
# works of art
doc1_woa=Counter([(ent.orth_, ent.label_) for ent in doc1.ents if ent.label_=="WORK_OF_ART"])

This lets us answer who are the most common persons and ...

In [None]:
doc1_person.most_common(20)

... what are the most common places

In [None]:
doc1_gpe.most_common(20)

##  Insights from plotting text data 

As an example of how plotting data can give us insight in the text, we will use the named entity recogntion data to track different persons and locations throughout the text.

Let's make a list of the 10 most common persons in the book ...

In [None]:
person_tracklist=[e for e, f in doc1_person.most_common(10)]
person_tracklist

... and track them over time

In [None]:
person_trackdict=freq_over_time(doc1, person_tracklist)

Now, we can plot the evolution of the tracked entities throughout the text

In [None]:
plt.style.use("fivethirtyeight")
# plt.plot(person_trackdict[('Lorry', 'PERSON')], label="Lorry")
# plt.plot(person_trackdict[('Cruncher', 'PERSON')], label="Cruncher")
plt.plot(person_trackdict[('Manette', 'PERSON')], label="Manette")
plt.plot(person_trackdict[('Miss Pross', 'PERSON')], label="Miss Pross")
# plt.plot(person_trackdict[('Jerry', 'PERSON')], label="Jerry")
plt.plot(person_trackdict[('Lucie', 'PERSON')], label="Lucie")
plt.ylabel("cumulative frequency")
plt.xlabel("occurence in book")
plt.legend(loc="upper left")

In [None]:
gpe_tracklist=[e for e, f in doc1_gpe.most_common(10)]
gpe_trackdict=freq_over_time(doc1, gpe_tracklist)

In [None]:
gpe_tracklist

In [None]:
plt.style.use("fivethirtyeight")
plt.plot(gpe_trackdict[('Paris', 'GPE')], label="Paris")
# plt.plot(gpe_trackdict[('France', 'GPE')], label="France")
plt.plot(gpe_trackdict[('London', 'GPE')], label="London")
# plt.plot(gpe_trackdict[('England', 'GPE')], label="England")
plt.ylabel("cumulative frequency")
plt.xlabel("book progression (entities)")
plt.legend(loc="upper left")

In [None]:
plt.style.use("fivethirtyeight")
plt.plot(gpe_trackdict[('Paris', 'GPE')], label="Paris")
plt.plot(person_trackdict[('Miss Pross', 'PERSON')], label="Miss Pross")
plt.plot(gpe_trackdict[('London', 'GPE')], label="London")
plt.plot(person_trackdict[('Lucie', 'PERSON')], label="Lucie")
plt.ylabel("cumulative frequency")
plt.xlabel("book progression (entities)")
plt.legend(loc="upper left")

## Combining data from parsing and named entity recognition

To conclude, we combine the output of the NLP parsing with the NLP named entity recognition to see what an entity is doing during the text ...

In [None]:
track_entity="Paris"
for entity in doc1.ents: 
    if entity.orth_==track_entity:
        token=entity[0]
        if token.dep_=="dobj":
            print("+ {:s}".format(' '.join([token.orth_ for token in token.head.subtree])))

Or to find out when two entities occur together in a sentence ....

In [None]:
track_set=set(["France", "England"],)
for sentence in doc1.sents:
    subdoc=nlp(sentence.orth_, )
    entities=set([ent.orth_ for ent in subdoc.ents])
    if len(track_set.intersection(entities))==len(track_set):
        print('+',' '.join([token.orth_ for token in sentence]))