## Part-1: Introduction

In [29]:
## Loading the libraries

import spacy                # open-source NLP library in Python with several pre-trained models
from spacy import displacy  # spacy's built-in library to visualise the behavior of the entity recognition model interactively
import en_core_web_sm       # English pipeline optimized for CPU
nlp = en_core_web_sm.load()   

In [30]:
## Sample sentence

doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')

print([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]


In the above code, the variables represent:


*   doc.ents: entity token spans 
*   X.text: the original entity text
*   X.label_: the entity type's string description

Entity types:

* NORP - Nationalities or religious or political groups.
* ORG - Companies, agencies, institutions, etc.
* MONEY - Monetary values, including unit.
* DATE - Absolute or relative dates or periods.

In [31]:
print(spacy.explain("NORP"))
print(spacy.explain("ORG"))

Nationalities or religious or political groups
Companies, agencies, institutions, etc.


In [32]:
## Visualizing the entities

displacy.render(doc, jupyter=True, style='ent')

In [33]:
## Visualizing the dependency tree
## The dependency visualizer, dep, shows part-of-speech tags and syntactic dependencies.

displacy.render(doc, jupyter = True, style='dep', options = {'distance': 50})

## Part-2: Example

In [34]:
import requests

target_url = "https://raw.githubusercontent.com/sharvitomar/text-file/main/temp.txt"
response = requests.get(target_url)
data = response.text
data

'Technology firms including Microsoft have tried to disrupt a cybercriminal group whose malicious software has been used in ransomware attacks and other hacks around the world, the companies said Wednesday.\n\nThe effort included a court order from the US District Court for the Northern District of Georgia that allowed Microsoft (MSFT) to seize 65 internet domains used by the hacking group behind widely used malware known as ZLoader, Microsoft said.\n\nSince surfacing in 2019, ZLoader has been used in an array of financially motivated hacking schemes — many of them aimed at organizations in North America. The hackers have also been involved in a tool for deploying a type of ransomware that has to be used in hacks against health care organizations, according to Microsoft.\n\nMicrosoft said it identified one of the people involved in the hacking enterprise and that it referred information to law enforcement authorities.\nThe US Justice Department did not respond to a request for comment.

In [35]:
article = nlp(data)
print(article)

Technology firms including Microsoft have tried to disrupt a cybercriminal group whose malicious software has been used in ransomware attacks and other hacks around the world, the companies said Wednesday.

The effort included a court order from the US District Court for the Northern District of Georgia that allowed Microsoft (MSFT) to seize 65 internet domains used by the hacking group behind widely used malware known as ZLoader, Microsoft said.

Since surfacing in 2019, ZLoader has been used in an array of financially motivated hacking schemes — many of them aimed at organizations in North America. The hackers have also been involved in a tool for deploying a type of ransomware that has to be used in hacks against health care organizations, according to Microsoft.

Microsoft said it identified one of the people involved in the hacking enterprise and that it referred information to law enforcement authorities.
The US Justice Department did not respond to a request for comment.

Other 

In [36]:
## 1. Number of entities in the article
len(article.ents)

32

In [37]:
from collections import Counter

## 2. Number of unique labels of the entities
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'ORG': 15,
         'DATE': 3,
         'LOC': 2,
         'GPE': 6,
         'CARDINAL': 4,
         'NORP': 1,
         'PERSON': 1})

In [38]:
## 3. The 3 most frequent tokens
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('Microsoft', 6), ('ZLoader', 2), ('Wednesday', 1)]

In [39]:
## 4. Visualise 1 sentence
sentences = [x for x in article.sents]
displacy.render(sentences[2], jupyter=True, style='ent')

In [40]:
## 5. Visualizating the entire article
displacy.render(article, jupyter=True, style='ent')

## Part-3: Adding custom tokens as NE


In [41]:
doc = nlp('Tesla to build a U.K. factory for $6 million')
print([(X.text, X.label_) for X in doc.ents])

[('U.K.', 'GPE'), ('$6 million', 'MONEY')]


Right now, spaCy foes not recognize "Tesla" as a company.

In [42]:
from spacy.tokens import Span

# Get the hash value of the ORG entity label
ORG = doc.vocab.strings['ORG']

# Create a Span for the new entity
new_ent = Span(doc, 0, 1, label=ORG)

# Add the entity to the existing Doc object
doc.ents = list(doc.ents) + [new_ent]

In the code above, the arguments passed to Span() are:



*  doc -  the name of the Doc object
*  0 - the start index position of the token in the doc
*  1 - the stop index position(exclusive) of the token in the doc
*  label = ORG - the label assigned to our entity



In [43]:
doc.vocab.strings['ORG']


383

In [44]:
print([(X.text, X.label_) for X in doc.ents])

[('Tesla', 'ORG'), ('U.K.', 'GPE'), ('$6 million', 'MONEY')]


In [45]:
displacy.render(doc, jupyter=True, style='ent')