## Part Of Speech POS Tagging

Parts of Speech tagging is a linguistic activity in Natural Language Processing (NLP) wherein each word in a document is given a particular part of speech (adverb, adjective, verb, etc.) or grammatical category. Through the addition of a layer of syntactic and semantic information to the words, this procedure makes it easier to comprehend the sentence’s structure and meaning.

In NLP applications, POS tagging is useful for machine translation, named entity recognition, and information extraction, among other things. It also works well for clearing out ambiguity in terms with numerous meanings and revealing a sentence’s grammatical structure.

![image.png](attachment:a93ea046-aceb-43b0-aeb1-be2ce9c919bf.png)

![IMG-20250111-WA0021.jpg](attachment:298c798f-44c5-4db2-891e-99b16924e209.jpg)

![IMG-20250111-WA0022.jpg](attachment:6fe957bc-b376-4b24-b8a1-e2afa86f45dc.jpg)

![IMG-20250111-WA0023.jpg](attachment:e6dca1c3-aefa-45ef-9df5-a29039669338.jpg)

![IMG-20250111-WA0024.jpg](attachment:8b264593-edfb-462a-97c2-8119a303b935.jpg)

![IMG-20250111-WA0025.jpg](attachment:c745e058-fe5a-4f96-b67e-bac4df215555.jpg)

![IMG-20250111-WA0026.jpg](attachment:95068271-6116-45bb-8273-2507f528efcf.jpg)

### POS tags

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
doc = nlp("Elon flew to mars yesterday. He carried biryani masala with him")

for token in doc:
    print(token ," --> ",token.pos_)

Elon  -->  PROPN
flew  -->  VERB
to  -->  ADP
mars  -->  NOUN
yesterday  -->  NOUN
.  -->  PUNCT
He  -->  PRON
carried  -->  VERB
biryani  -->  ADJ
masala  -->  NOUN
with  -->  ADP
him  -->  PRON


In [4]:
doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")

for token in doc:
    print(token," ---> ",token.pos_," ---> ",spacy.explain(token.pos_))

Wow  --->  INTJ  --->  interjection
!  --->  PUNCT  --->  punctuation
Dr.  --->  PROPN  --->  proper noun
Strange  --->  PROPN  --->  proper noun
made  --->  VERB  --->  verb
265  --->  NUM  --->  numeral
million  --->  NUM  --->  numeral
$  --->  NUM  --->  numeral
on  --->  ADP  --->  adposition
the  --->  DET  --->  determiner
very  --->  ADV  --->  adverb
first  --->  ADJ  --->  adjective
day  --->  NOUN  --->  noun


### Tags

In [5]:
doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")

for token in doc:
    print(token," ---> ",token.pos_," ---> ",token.tag_," ---> ",spacy.explain(token.tag_))

Wow  --->  INTJ  --->  UH  --->  interjection
!  --->  PUNCT  --->  .  --->  punctuation mark, sentence closer
Dr.  --->  PROPN  --->  NNP  --->  noun, proper singular
Strange  --->  PROPN  --->  NNP  --->  noun, proper singular
made  --->  VERB  --->  VBD  --->  verb, past tense
265  --->  NUM  --->  CD  --->  cardinal number
million  --->  NUM  --->  CD  --->  cardinal number
$  --->  NUM  --->  CD  --->  cardinal number
on  --->  ADP  --->  IN  --->  conjunction, subordinating or preposition
the  --->  DET  --->  DT  --->  determiner
very  --->  ADV  --->  RB  --->  adverb
first  --->  ADJ  --->  JJ  --->  adjective (English), other noun-modifier (Chinese)
day  --->  NOUN  --->  NN  --->  noun, singular or mass


#### Spacy figures out the past vs present tense for quit

In [6]:
doc = nlp("He quits the job")

print(doc[1].text, "|", doc[1].tag_, "|", spacy.explain(doc[1].tag_))

quits | VBZ | verb, 3rd person singular present


In [7]:
doc = nlp("He quits the job")

print(doc[1].text, "|", doc[1].tag_, "|", spacy.explain(doc[1].tag_))

quits | VBZ | verb, 3rd person singular present


## Removing all SPACE, PUNCT and X token from text

In [8]:
earnings_text="""Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)

for token in doc:
        print(token," ---> ",token.pos_," ---> ",spacy.explain(token.pos_))


Microsoft  --->  PROPN  --->  proper noun
Corp.  --->  PROPN  --->  proper noun
today  --->  NOUN  --->  noun
announced  --->  VERB  --->  verb
the  --->  DET  --->  determiner
following  --->  VERB  --->  verb
results  --->  NOUN  --->  noun
for  --->  ADP  --->  adposition
the  --->  DET  --->  determiner
quarter  --->  NOUN  --->  noun
ended  --->  VERB  --->  verb
December  --->  PROPN  --->  proper noun
31  --->  NUM  --->  numeral
,  --->  PUNCT  --->  punctuation
2021  --->  NUM  --->  numeral
,  --->  PUNCT  --->  punctuation
as  --->  SCONJ  --->  subordinating conjunction
compared  --->  VERB  --->  verb
to  --->  ADP  --->  adposition
the  --->  DET  --->  determiner
corresponding  --->  ADJ  --->  adjective
period  --->  NOUN  --->  noun
of  --->  ADP  --->  adposition
last  --->  ADJ  --->  adjective
fiscal  --->  ADJ  --->  adjective
year  --->  NOUN  --->  noun
:  --->  PUNCT  --->  punctuation


  --->  SPACE  --->  space
·  --->  PUNCT  --->  punctuation
          --->

In [9]:
filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT", "X"]:
        print(token)


Microsoft
Corp.
today
announced
the
following
results
for
the
quarter
ended
December
31
2021
as
compared
to
the
corresponding
period
of
last
fiscal
year
Revenue
was
$
51.7
billion
and
increased
20
%
Operating
income
was
$
22.2
billion
and
increased
24
%
Net
income
was
$
18.8
billion
and
increased
21
%
Diluted
earnings
per
share
was
$
2.48
and
increased
22
%
Digital
technology
is
the
most
malleable
resource
at
the
world
’s
disposal
to
overcome
constraints
and
reimagine
everyday
work
and
life
said
Satya
Nadella
chairman
and
chief
executive
officer
of
Microsoft
As
tech
as
a
percentage
of
global
GDP
continues
to
increase
we
are
innovating
and
investing
across
diverse
and
growing
markets
with
a
common
underlying
technology
stack
and
an
operating
model
that
reinforces
a
common
strategy
culture
and
sense
of
purpose
Solid
commercial
execution
represented
by
strong
bookings
growth
driven
by
long
term
Azure
commitments
increased
Microsoft
Cloud
revenue
to
$
22.1
billion
up
32
%
year
over
year
sa

In [10]:
count = doc.count_by(spacy.attrs.POS)
count

{96: 15,
 92: 45,
 100: 23,
 90: 9,
 85: 16,
 93: 16,
 97: 27,
 98: 1,
 84: 20,
 103: 10,
 87: 6,
 99: 5,
 89: 12,
 86: 3,
 94: 3,
 95: 2}

In [12]:
doc.vocab[96].text

'PROPN'

In [14]:
for k,v in count.items():
    print(doc.vocab[k].text, " ---> ",v)

PROPN  --->  15
NOUN  --->  45
VERB  --->  23
DET  --->  9
ADP  --->  16
NUM  --->  16
PUNCT  --->  27
SCONJ  --->  1
ADJ  --->  20
SPACE  --->  10
AUX  --->  6
SYM  --->  5
CCONJ  --->  12
ADV  --->  3
PART  --->  3
PRON  --->  2
