### Setup Environment

In [1]:
#!pip install torch torchvision
#!pip install spacy
#!python -m spacy download en_core_web_trf

In [2]:
import pandas as pd

import spacy
import spacy_transformers
from spacy import displacy
from spacy.tokens import Span

transformer = spacy.load('en_core_web_trf') #other options - en_core_web_sm, en_core_web_lg

### NER on text

In [93]:
#source : https://www.nytimes.com/2023/06/09/opinion/ai-big-tech-microsoft-google-duopoly.html

text = "In just a few months, Microsoft broke speed records in establishing ChatGPT, a form of generative artificial intelligence that it plans to invest $10 billion into, as a household name. And last month, Sundar Pichai, C.E.O. of Alphabet/Google, unveiled a suite of A.I. tools — including for email, spreadsheets and drafting all manner of text. While there is some discussion as to whether Meta’s recent decision to give away its A.I. computer code will accelerate its progress, the reality is that all competitors to Alphabet and Microsoft remain far behind."

text2 = "Smoke from wildfires in Canada has drifted down into the U.S. on Wednesday, leading to extremely poor air quality across much of the eastern U.S., with alerts in effect all the way from New England to the Southeast. In all, more than 100 million Americans were affected by air quality alerts, the Environmental Protection Agency said."

text3 = "Global esports qualfied."

In [48]:
def show_entities(doc) : 
    list_out = []
    if doc.ents :
        for ent in doc.ents : 
            out = (ent.text) + " | " + str(ent.start_char) + " | " + str(ent.end_char)\
                  + " | " + str(ent.label_) + " | " + str(spacy.explain(ent.label_)) 
            #print(out)
            list_out.append(out)
    else :
        print("Entities not found")
        
    return pd.DataFrame([x.split('|') for x in list_out], columns=['Word', 'Start', 'End', 'Entity', 'Description'])

#### Example 1

In [75]:
doc = transformer(text)
doc

In just a few months, Microsoft broke speed records in establishing ChatGPT, a form of generative artificial intelligence that it plans to invest $10 billion into, as a household name. And last month, Sundar Pichai, C.E.O. of Alphabet/Google, unveiled a suite of A.I. tools — including for email, spreadsheets and drafting all manner of text. While there is some discussion as to whether Meta’s recent decision to give away its A.I. computer code will accelerate its progress, the reality is that all competitors to Alphabet and Microsoft remain far behind.

In [76]:
displacy.render(doc, style='ent')

In [49]:
res = show_entities(doc)
res

Unnamed: 0,Word,Start,End,Entity,Description
0,just a few months,3,20,DATE,Absolute or relative dates or periods
1,Microsoft,22,31,ORG,"Companies, agencies, institutions, etc."
2,ChatGPT,68,75,PRODUCT,"Objects, vehicles, foods, etc. (not services)"
3,$10 billion,146,157,MONEY,"Monetary values, including unit"
4,last month,189,199,DATE,Absolute or relative dates or periods
5,Sundar Pichai,201,214,PERSON,"People, including fictional"
6,Alphabet/Google,226,241,ORG,"Companies, agencies, institutions, etc."
7,Meta,388,392,ORG,"Companies, agencies, institutions, etc."
8,Alphabet,516,524,ORG,"Companies, agencies, institutions, etc."
9,Microsoft,529,538,ORG,"Companies, agencies, institutions, etc."


#### Example 2

In [60]:
doc2 = transformer(text2)
doc2

Tesla Smoke from wildfires in Canada has drifted down into the U.S. on Wednesday, leading to extremely poor air quality across much of the eastern U.S., with alerts in effect all the way from New England to the Southeast. In all, more than 100 million Americans were affected by air quality alerts, the Environmental Protection Agency said.

In [61]:
doc2 = transformer(text2)
displacy.render(doc2, style='ent')

In [54]:
res2 = show_entities(doc2)
res2

Unnamed: 0,Word,Start,End,Entity,Description
0,Canada,24,30,GPE,"Countries, cities, states"
1,U.S.,57,61,GPE,"Countries, cities, states"
2,Wednesday,65,74,DATE,Absolute or relative dates or periods
3,U.S.,141,145,GPE,"Countries, cities, states"
4,New England,186,197,LOC,"Non-GPE locations, mountain ranges, bodies of..."
5,Southeast,205,214,LOC,"Non-GPE locations, mountain ranges, bodies of..."
6,more than 100 million,224,245,CARDINAL,Numerals that do not fall under another type
7,Americans,246,255,NORP,Nationalities or religious or political groups
8,the Environmental Protection Agency,293,328,ORG,"Companies, agencies, institutions, etc."


### Add custom entity tag

In [94]:
doc3 = transformer(text3)
displacy.render(doc3, style='ent')

Here "Global Esports" is an organization which is not identified by the model. 
We can add tags by ourselves using Span module.

In [95]:
ORG = doc3.vocab.strings [u'ORG']
new_ent = Span(doc3, 0, 2, label=ORG)
doc3.ents = list (doc3.ents) + [new_ent]
displacy.render(doc3, style='ent')