# **Text Tagging**

- POS (Parts of Speech tagging)
- NER (Named Entity Recognition)

## POS (Part Of Speech Tagging)

In [1]:
import spacy
import pandas as pd

In [2]:
nlp = spacy.load('en_core_web_sm')
# if you are running this for the first time, or recieve an error "Can't find model 'en_core_web_sm'", 
# then please run the follwing in your terminal: python -m spacy download en_core_web_sm

In [3]:
emma_ja = "emma woodhouse handsome clever and rich with a comfortable home \
and happy disposition seemed to unite some of the best blessings of existence \
and had lived nearly twentyone years in the world with very little to distress \
or vex her she was the youngest of the two daughters of a most affectionate indulgent \
father and had in consequence of her sisters marriage been mistress of his house from a \
very early period her mother had died too long ago for her to have more than an indistinct \
remembrance of her caresses and her place had been supplied by an excellent woman as governess \
who had fallen little short of a mother in affection sixteen years had miss taylor been in mr \
woodhouses family less as a governess than a friend very fond of both daughters but particularly \
of emma between them it was more the intimacy of sisters even before miss taylor had ceased to hold \
the nominal office of governess the mildness of her temper had hardly allowed her to impose any restraint \
and the shadow of authority being now long passed away they had been living together as friend and friend \
very mutually attached and emma doing just what she liked highly esteeming miss taylors judgment but directed \
chiefly by her own"

In [4]:
# create a spacy doc from our text - this will generate tokens and their assosciated pos tags
spacy_doc = nlp(emma_ja)
print(spacy_doc)

emma woodhouse handsome clever and rich with a comfortable home and happy disposition seemed to unite some of the best blessings of existence and had lived nearly twentyone years in the world with very little to distress or vex her she was the youngest of the two daughters of a most affectionate indulgent father and had in consequence of her sisters marriage been mistress of his house from a very early period her mother had died too long ago for her to have more than an indistinct remembrance of her caresses and her place had been supplied by an excellent woman as governess who had fallen little short of a mother in affection sixteen years had miss taylor been in mr woodhouses family less as a governess than a friend very fond of both daughters but particularly of emma between them it was more the intimacy of sisters even before miss taylor had ceased to hold the nominal office of governess the mildness of her temper had hardly allowed her to impose any restraint and the shadow of auth

In [None]:
pos_df = pd.DataFrame(columns=['token', 'pos_tag'])

In [10]:
# NOTE: no need to remove stopwords. stopwords are also part of the recognition
for token in spacy_doc:
    print(token.text, token.pos_)
    pos_df = pd.concat([pos_df, pd.DataFrame.from_records([{'token': token.text, 'pos_tag': token.pos_}])], ignore_index=True)
# ignore_index=True resets the index on every concatenation

emma PROPN
woodhouse PROPN
handsome ADJ
clever ADJ
and CCONJ
rich ADJ
with ADP
a DET
comfortable ADJ
home NOUN
and CCONJ
happy ADJ
disposition NOUN
seemed VERB
to PART
unite VERB
some PRON
of ADP
the DET
best ADJ
blessings NOUN
of ADP
existence NOUN
and CCONJ
had AUX
lived VERB
nearly ADV
twentyone NUM
years NOUN
in ADP
the DET
world NOUN
with ADP
very ADV
little ADJ
to PART
distress NOUN
or CCONJ
vex NOUN
her PRON
she PRON
was AUX
the DET
youngest ADJ
of ADP
the DET
two NUM
daughters NOUN
of ADP
a DET
most ADV
affectionate ADJ
indulgent ADJ
father NOUN
and CCONJ
had AUX
in ADP
consequence NOUN
of ADP
her PRON
sisters NOUN
marriage NOUN
been AUX
mistress NOUN
of ADP
his PRON
house NOUN
from ADP
a DET
very ADV
early ADJ
period NOUN
her PRON
mother NOUN
had AUX
died VERB
too ADV
long ADV
ago ADV
for SCONJ
her PRON
to PART
have VERB
more ADJ
than ADP
an DET
indistinct ADJ
remembrance NOUN
of ADP
her PRON
caresses NOUN
and CCONJ
her PRON
place NOUN
had AUX
been AUX
supplied VERB
by ADP
an 

In [14]:
pos_df.head(15)

Unnamed: 0,token,pos_tag
0,emma,PROPN
1,woodhouse,PROPN
2,handsome,ADJ
3,clever,ADJ
4,and,CCONJ
5,rich,ADJ
6,with,ADP
7,a,DET
8,comfortable,ADJ
9,home,NOUN


In [23]:
# token frequency count
pos_df_counts = pos_df.groupby(['token', 'pos_tag']).size().reset_index(name='counts').sort_values(by='counts', ascending=False)
# size() computes the number of rows in each group — essentially counting how many times each ( token ,  pos_tag ) combination occurs
# reset_index()  turns the MultiIndex into normal DataFrame columns and names the count column  "counts"  

pos_df_counts.head(10)

Unnamed: 0,token,pos_tag,counts
88,of,ADP,42
49,had,AUX,27
54,her,PRON,27
111,the,DET,24
6,and,CCONJ,24
0,a,DET,18
114,to,PART,15
61,in,ADP,12
13,been,AUX,12
120,very,ADV,12


In [21]:
# counts of pos_tags
pos_df_poscounts = pos_df_counts.groupby(['pos_tag'])['token'].count().sort_values(ascending=False)
pos_df_poscounts.head(10)

pos_tag
NOUN     35
VERB     19
ADJ      18
ADV      18
PRON      9
ADP       8
PROPN     6
DET       5
AUX       4
CCONJ     3
Name: token, dtype: int64

In [None]:
# see most common nouns
nouns = pos_df_counts[pos_df_counts.pos_tag == "NOUN"][0:10]
nouns

Unnamed: 0,token,pos_tag,counts
48,governess,NOUN,6
46,friend,NOUN,6
130,years,NOUN,4
35,emma,NOUN,4
28,daughters,NOUN,4
103,sisters,NOUN,4
82,mother,NOUN,4
89,office,NOUN,2
78,mistress,NOUN,2
75,mildness,NOUN,2


## Named Entity Recognition

In [17]:
import spacy
from spacy import displacy
from spacy import tokenizer
from IPython.display import display, HTML
import re
nlp = spacy.load('en_core_web_sm')

In [18]:
google_text = "Google was founded on September 4, 1998, by computer scientists Larry Page and Sergey Brin while they were PhD students at Stanford University in California. Together they own about 14% of its publicly listed shares and control 56% of its stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly owned subsidiary of Alphabet Inc. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry Page, who became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of Alphabet."
print(google_text)

Google was founded on September 4, 1998, by computer scientists Larry Page and Sergey Brin while they were PhD students at Stanford University in California. Together they own about 14% of its publicly listed shares and control 56% of its stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly owned subsidiary of Alphabet Inc. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry Page, who became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of Alphabet.


In [19]:
spacy_doc = nlp(google_text)
# spacy doc stores the entity labels for each token in .ents
# 'label_' gives the name of the label, where 'label' only gives the integer code for the label
for word in spacy_doc.ents:
    print(word.text, word.label_)

Google ORG
September 4, 1998 DATE
Larry Page PERSON
Sergey Brin PERSON
PhD WORK_OF_ART
Stanford University ORG
California GPE
about 14% PERCENT
56% PERCENT
IPO ORG
2004 DATE
2015 DATE
Google ORG
Alphabet Inc. ORG
Alphabet ORG
Alphabet ORG
Sundar Pichai PERSON
Google ORG
October 24, 2015 DATE
Larry Page PERSON
Alphabet GPE
December 3, 2019 DATE
Pichai PERSON
Alphabet GPE


In [22]:
html = displacy.render(spacy_doc,style="ent",jupyter=False)
HTML(html)

***Cleaning the text can lead to lesser efficient tagging by spacy as the module may <br />depend on punctuation's etc. from the language to analyze text***

In [25]:
google_text_clean = re.sub(r'[^\w\s]', '', google_text).lower() # remove punctuation and lowercase
print(google_text_clean)

google was founded on september 4 1998 by computer scientists larry page and sergey brin while they were phd students at stanford university in california together they own about 14 of its publicly listed shares and control 56 of its stockholder voting power through supervoting stock the company went public via an initial public offering ipo in 2004 in 2015 google was reorganized as a wholly owned subsidiary of alphabet inc google is alphabets largest subsidiary and is a holding company for alphabets internet properties and interests sundar pichai was appointed ceo of google on october 24 2015 replacing larry page who became the ceo of alphabet on december 3 2019 pichai also became the ceo of alphabet


In [26]:
spacy_doc_clean = nlp(google_text_clean)
for word in spacy_doc_clean.ents:
    print(word.text, word.label_)

google ORG
september 4 1998 DATE
stanford university ORG
california GPE
about 14 CARDINAL
56 CARDINAL
2004 DATE
2015 DATE
alphabet inc google ORG
google ORG
october 24 2015 DATE
larry PERSON
december 3 2019 DATE


In [27]:
HTML(displacy.render(spacy_doc_clean,style="ent",jupyter=False))