<a href="https://colab.research.google.com/github/sanvika15/nlp-lec5/blob/main/lec5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

POS Tagging

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
doc = nlp("Elon flew to mars yesterday. He carried biryani masala with him")

for token in doc:
    print(token, "|" , token.pos_,spacy.explain(token.pos_))

Elon | PROPN proper noun
flew | VERB verb
to | ADP adposition
mars | NOUN noun
yesterday | NOUN noun
. | PUNCT punctuation
He | PRON pronoun
carried | VERB verb
biryani | ADJ adjective
masala | NOUN noun
with | ADP adposition
him | PRON pronoun


In [None]:
doc = nlp("Elon flew to mars yesterday. He carried biryani masala with him")

for token in doc:
    print(token, "|" , token.pos_,spacy.explain(token.pos_),token.tag_,"|",spacy.explain(token.tag_))

Elon | PROPN proper noun NNP | noun, proper singular
flew | VERB verb VBD | verb, past tense
to | ADP adposition IN | conjunction, subordinating or preposition
mars | NOUN noun NNS | noun, plural
yesterday | NOUN noun NN | noun, singular or mass
. | PUNCT punctuation . | punctuation mark, sentence closer
He | PRON pronoun PRP | pronoun, personal
carried | VERB verb VBD | verb, past tense
biryani | ADJ adjective JJ | adjective (English), other noun-modifier (Chinese)
masala | NOUN noun NN | noun, singular or mass
with | ADP adposition IN | conjunction, subordinating or preposition
him | PRON pronoun PRP | pronoun, personal


In [None]:
doc = nlp("He quit the job")
doc[1]
print(doc[1].text, "|" ,doc[1].tag_,"|",spacy.explain(doc[1].tag_))

quit | VBD | verb, past tense


In [None]:
doc = nlp("He quits the job")
doc[1]
print(doc[1].text, "|" ,doc[1].tag_,"|",spacy.explain(doc[1].tag_))

quits | VBZ | verb, 3rd person singular present


In [None]:
text="""Microsoft Corp. today announced the following results for the quarter ended March 31, 2025, as compared to the corresponding period of last fiscal year:

·        Revenue was $70.1 billion and increased 13% (up 15% in constant currency)

·        Operating income was $32.0 billion and increased 16% (up 19% in constant currency)

·        Net income was $25.8 billion and increased 18% (up 19% in constant currency)

·        Diluted earnings per share was $3.46 and increased 18% (up 19% in constant currency)"""

In [None]:
doc = nlp(text)
filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE", "X", "PUNCT"]:
        filtered_tokens.append(token)


In [None]:
filtered_tokens[:2]

[Microsoft, Corp.]

In [None]:
COUNT =doc.count_by(spacy.attrs.POS)
COUNT

{96: 3,
 92: 22,
 100: 10,
 90: 3,
 85: 8,
 93: 17,
 97: 15,
 98: 1,
 84: 8,
 103: 8,
 87: 4,
 99: 4,
 89: 4,
 86: 4}

In [None]:
for k,v in COUNT.items():
    print(doc.vocab[k].text,"|",v)

PROPN | 3
NOUN | 22
VERB | 10
DET | 3
ADP | 8
NUM | 17
PUNCT | 15
SCONJ | 1
ADJ | 8
SPACE | 8
AUX | 4
SYM | 4
CCONJ | 4
ADV | 4


NER (Named Entity Recgnition)

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [36]:
from spacy import displacy

displacy.render(doc, style="ent")

In [37]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [38]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | GPE | Countries, cities, states
1982 | DATE | Absolute or relative dates or periods


Above it made a mistake in identifying Bloomberg the company. Let's try hugging face for this now.

https://huggingface.co/dslim/bert-base-NER?text=Michael+Bloomberg+founded+Bloomberg+in+1982

Here also go through 3 sample examples for NER

In [39]:
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char)

Tesla Inc  |  ORG  |  0 | 9
Twitter Inc  |  PERSON  |  30 | 41
$45 billion  |  MONEY  |  46 | 57


In [40]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  PERSON
$45 billion  |  MONEY


In [41]:
s = doc[2:5]
s

going to acquire

In [42]:
type(s)

spacy.tokens.span.Span

In [43]:
from spacy.tokens import Span

s1 = Span(doc, 0, 1, label="ORG")
s2 = Span(doc, 5, 6, label="ORG")

doc.set_ents([s1, s2], default="unmodified")

In [44]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  ORG
$45 billion  |  MONEY
