## Custom Name Entity Recognition using Spacy
Identifying Named Entities stands out as a crucial task in Natural Language Processing (NLP), playing a pivotal role in processing data. The goal is to pinpoint and categorize significant information, such as entities, within textual data. These entities encompass words or word sequences, typically proper nouns, consistently representing a specific entity. As an example, a system for entity detection might identify the term "NewsCatcher" in text and label it as an "Organization."

At its core, all entity recognition systems have two steps:

- Detecting the entities in text
- Categorizing the entities into named classes

1. In the first step, NER finds where in the text an entity starts and ends using a method called inside-outside-beginning chunking.

2. The second step involves putting entities into categories. These categories can change based on what you're looking for, but common ones include people, organizations, locations, time, measurements, and patterns like emails or phone numbers.

While there are some rule-based approaches, most modern systems use machine learning or deep learning. Since text can be tricky with its ambiguity, like the word 'Sydney' being both a place and a person's name, these systems help make sense of it.

### Applications of Name Entity Recognition
NER is like a super-smart assistant for dealing with lots of text. It's handy whenever you need the computer to quickly figure out what a bunch of text is all about. A good NER helps the computer get the gist of the subject or main idea in the text and sort documents based on how relevant they are. It's like having a fast and efficient organizer for a mountain of information!

List of applications are:
- Information Extraction And Summarization
- Optimizing Search Engines
- Machine Translation
- Content Classification
- Customer Support

### Practical Implementation

**NER in Spacy**
Think of spaCy as the quick and efficient superhero of Python for dealing with language stuff. It's really fast and comes with handy tools for understanding text. In the latest version, spaCy v3.0, it got even better with the latest and coolest tech. When you use spaCy, it automatically brings in tools like figuring out parts of speech, understanding the structure of sentences, and spotting important named entities. It's like a one-stop-shop for making sense of words!

In [1]:
!pip install spacy[transformers]



In [2]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


#### 1. Imports

In [3]:
import spacy
import json

nlp = spacy.load("en_core_web_lg")

print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [4]:
text = "Samantha loves her cozy home in Green Valley. Every morning, she sips hot cocoa on her favorite blue couch and chats with her best friend, Benny the Bunny. Outside, the sun shines on tall trees, and birds sing sweet melodies. In the evening, Samantha and Benny enjoy tasty carrot snacks together. Life in Green Valley is simple and joyful, filled with warmth and friendship."
doc = nlp(text)
print(doc)
print(type(doc))

Samantha loves her cozy home in Green Valley. Every morning, she sips hot cocoa on her favorite blue couch and chats with her best friend, Benny the Bunny. Outside, the sun shines on tall trees, and birds sing sweet melodies. In the evening, Samantha and Benny enjoy tasty carrot snacks together. Life in Green Valley is simple and joyful, filled with warmth and friendship.
<class 'spacy.tokens.doc.Doc'>


In [5]:
# entities
print(doc.ents)

(Samantha, Green Valley, Every morning, Benny the Bunny, Samantha, Benny, Green Valley)


In [6]:
print(type(doc.ents))

<class 'tuple'>


In [7]:
print(doc.ents[0], type(doc.ents[0]), sep="\n")

Samantha
<class 'spacy.tokens.span.Span'>


In [8]:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

#### 2. Data Loading and Processing
The data used in this experiment is directly came from [here](https://www.kaggle.com/datasets/finalepoch/medical-ner)

In [9]:
with open('./Corona2.json', 'r') as f:
    data = json.load(f)

In [10]:
data['examples'][0]

{'id': '18c2f619-f102-452f-ab81-d26f7e283ffe',
 'content': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]",
 'metadata': {},
 'annotations': [{'id': '0825a1

In [11]:
data['examples'][0].keys()

dict_keys(['id', 'content', 'metadata', 'annotations', 'classifications'])

In [12]:
data['examples'][0]['content']

"While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]"

In [13]:
data['examples'][0]['annotations'][0]

{'id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
 'tag_id': 'c06bd022-6ded-44a5-8d90-f17685bb85a1',
 'end': 371,
 'start': 360,
 'example_id': '18c2f619-f102-452f-ab81-d26f7e283ffe',
 'tag_name': 'Medicine',
 'value': 'Diosmectite',
 'correct': None,
 'human_annotations': [{'timestamp': '2020-03-21T00:24:32.098000Z',
   'annotator_id': 1,
   'tagged_token_id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
   'name': 'Ashpat123',
   'reason': 'exploration'}],
 'model_annotations': []}

From all these key value pairs, we only required the text string, the entity start and end indices, and the entity type.

In [14]:
training_data = {'classes' : ['MEDICINE', "MEDICALCONDITION", "PATHOGEN"], 'annotations' : []}
for example in data['examples']:
  temp_dict = {}
  temp_dict['text'] = example['content']
  temp_dict['entities'] = []
  for annotation in example['annotations']:
    start = annotation['start']
    end = annotation['end']
    label = annotation['tag_name'].upper()
    temp_dict['entities'].append((start, end, label))
  training_data['annotations'].append(temp_dict)

In [15]:
training_data.keys()

dict_keys(['classes', 'annotations'])

In [16]:
print(training_data['annotations'])

[{'text': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]", 'entities': [(360, 371, 'MEDICINE'), (383, 408, 'MEDICINE'), (104, 112, 'MEDICALCONDITION'), (679

spaCy uses the DocBin class to handle annotated data. To get our training examples ready, we need to make DocBin objects. This class is really good at turning information from a bunch of Doc objects into a compact form. It's quicker and makes smaller data sizes compared to pickle. Plus, it lets you bring back the data without running any random Python code.

In [17]:
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin() # create a DocBin object

There are some entity span overlaps, i.e., the indices of some entities overlap. spaCy provides a utility method filter_spans to deal with this.

In [18]:
from spacy.util import filter_spans

for training_example  in tqdm(training_data['annotations']):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk("./training_data.spacy") # save the docbin object

100%|██████████| 31/31 [00:00<00:00, 241.27it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity





We’ll be working with a base config file created using the quickstart page.

In [19]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Now we have all that we need to train our model.

In [20]:
!python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy --gpu-id 0

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
config.json: 100% 481/481 [00:00<00:00, 2.47MB/s]
vocab.json: 100% 899k/899k [00:00<00:00, 8.97MB/s]
merges.txt: 100% 456k/456k [00:00<00:00, 7.34MB/s]
tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 29.3MB/s]
model.safetensors: 100% 499M/499M [00:04<00:00, 109MB/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['transformer', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0        3401.20    594.13    0.36    0.34    0.39    0.00
 66     200      119724.56  33

Let’s load the best-performing model and test it on a piece of text.

In [21]:
nlp_ner = spacy.load("model-best")

In [22]:
doc = nlp_ner("While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.")

colors = {"PATHOGEN": "#F67DE3", "MEDICINE": "#7DF6D9", "MEDICALCONDITION":"#a6e22d"}
options = {"colors": colors}

spacy.displacy.render(doc, style="ent", options= options, jupyter=True)

In this way we can train the custom NER model using Spacy. Remember: Since we have trained the custom NER model using health care dataset, the performance of this model might be worse for text data of other than health care domain. You can use the same process to train a custom NER model for your applications, you'll just need some annotated data. In case you can't find any pre-existing datasets for your use case, you can use one of the following data annotation tools to create your own:

- [Doccano](https://doccano.herokuapp.com/)
- [LightTag](https://www.lighttag.io/)
- [Prodigy](https://demo.prodi.gy/?=null&view_id=ner_manual)