<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Note:" data-toc-modified-id="Note:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Note:</a></span></li></ul></div>

In [1]:
import re
import spacy
import en_core_web_lg

In [2]:
# Load pre-existing spacy model
nlp = en_core_web_lg.load()

In [3]:
# Getting the pipeline component
ner = nlp.get_pipe("ner")

In [4]:
# Training data example
# ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]})

In [5]:
# TRAIN_DATA = [
#               ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
#               ("I booked a Lamborghini yesterday", {"entities": [(11, 22, "ORG")]}),
#               ("I recently ordered a book from Amazon", {"entities": [(31,37, "ORG")]})
#               ]

In [6]:
def create_training_data(text, ents):
    """For single line of text"""
    mapped_ents = []
    for ent in ents:
        ent_idx = re.search(ent[0], text).span()
        mapped_ents.append((ent_idx[0], ent_idx[1], ent[1]))
    return (text, {"entities": mapped_ents})

def create_bulk_training_data(data):
    """For list of text"""
    return [create_training_data(item[0], item[1]) for item in data]

In [7]:
create_training_data(
    text = "I recently ordered a book from Amazon",
    ents = [
        ("book", "PRODUCT"), 
        ("Amazon", "ORG")
        ])

('I recently ordered a book from Amazon',
 {'entities': [(21, 25, 'PRODUCT'), (31, 37, 'ORG')]})

In [8]:
data = [
    (
        "Walmart is a leading e-commerce company in the U.S",
        [
            ("e-commerce", "SERVICE_TYPE"),
            ("U.S", "LOC")
        ]),
    (
        "I booked a Lamborghini yesterday",
        [
            ("Lamborghini", "ORG")
            ]),
    (
        "I recently ordered a book from Amazon",
        [
            ("Amazon", "ORG")
            ])
    ]

In [9]:
TRAIN_DATA = create_bulk_training_data(data=data)
TRAIN_DATA

[('Walmart is a leading e-commerce company in the U.S',
  {'entities': [(0, 7, 'ORG')]}),
 ('I booked a Lamborghini yesterday', {'entities': [(11, 22, 'ORG')]}),
 ('I recently ordered a book from Amazon', {'entities': [(31, 37, 'ORG')]})]

In [10]:
# Adding labels to the `ner`
for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

But before you train, remember that apart from ner , the model has other pipeline components. These components should not get affected in training.

So, disable the other pipeline components through `nlp.disable_pipes()` method.

In [11]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [12]:
# Disable pipeline components you dont need to change
pipe_exceptions = ['tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'transformer', 'tok2vec']
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
print(unaffected_pipes)

['ner']


In [13]:
import random
from spacy.util import minibatch
from spacy.training import Example
from pathlib import Path

In [14]:
# To train an ner model, the model has to be looped over the example for sufficient number of iterations (atleast 20).
EPOCHS = 20
optimizer = nlp.create_optimizer()

with nlp.disable_pipes(*unaffected_pipes):
    for epoch in range(EPOCHS):
        # Before every iteration it’s a good practice to shuffle the examples randomly through random.shuffle() function. 
        # This will ensure the model does not make generalizations based on the order of the examples.
        random.shuffle(TRAIN_DATA)
        
        losses={}
        
        # The training data is usually passed in batches. 
        # Use the minibatch() function of spaCy over the training data that will return you data in batches. 
        # The minibatch function takes size parameter to denote the batch size.
        for batch in minibatch(TRAIN_DATA, size=1):
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                # For each iteration , the model or ner is updated through the nlp.update() command.
                nlp.update([example], drop=0.35, sgd=optimizer, losses=losses)
        print(f"Loss after epoch {epoch+1}: ", losses)

Loss after epoch 1:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 2:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 3:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 4:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 5:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 6:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 7:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 8:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 9:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 10:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 11:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 12:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 13:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 14:  {'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0}
Loss after epoch 15:  {'tok2vec': 0.0, 'tag

### Note:

* Parameters of **`nlp.update()`** are :

    * **docs:** 
    This expects a batch of texts as input. You can pass each batch to the zip method, which will return you batches of text and annotations.

    * **golds:** 
    You can pass the annotations we got through zip method here

    * **drop:** 
    This represents the dropout rate.
    
    * **losses:** 
    A Dictionary to hold the losses against each pipeline component. Create an empty dictionary and pass it here.



* At each word, the **`update()`** it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t , it adjusts the weights so that the correct action will score higher next time.


All of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.

In [15]:
nlp.to_disk("custom_ner")

In [16]:
nlp1 = nlp.from_disk("custom_ner")

In [17]:
ner_doc = nlp1("Today I drove a Lamborghini and ordered a book from Amazon")

In [18]:
from spacy import displacy

In [19]:
displacy.render(ner_doc, jupyter=True, style='ent')

In [20]:
for word in ner_doc.ents:
    print(word.text, word.label_)

Today DATE
Lamborghini ORG
Amazon ORG


In [3]:
import pandas as pd

In [4]:
df = pd.read_json('livemint_2023-01-23.json', lines=True)
df.head(3)

Unnamed: 0,news_link,pub_date,title,keywords,articles,scraped_date
0,https://www.livemint.com/companies/news/it-sta...,2023-01-22T22:58:58+05:30,"IT, startups may cut up to 20,000 jobs in ...","layoffs,job cuts,Indian IT,startups,employees,...","Pressure from investors, an expected global re...",2023-01-23
1,https://www.livemint.com/companies/news/rbitoa...,2023-01-22T22:39:03+05:30,RBI may appeal Bombay HC order on Yes Ba...,"RBI,Bombay high court,Yes Bank,additional tier...",Bombay high court ruling could have a huge sec...,2023-01-23
2,https://www.livemint.com/companies/people/we-w...,2023-01-22T22:36:50+05:30,"‘We want to maximize HNI, retail interest in A...","Adani Enterprises,fpo,sale,adani cfo,retail in...",Adani group has been running investor outreach...,2023-01-23


In [6]:
df.shape

(29, 6)

In [7]:
x = df['articles'][0]
print(x)

Pressure from investors, an expected global recession and the domino impact of global layoffs have resulted in staff cuts  India’s IT and startup sectors may lay off 15,000 to 20,000 employees in the next six months, battling slowing demand after the hiring frenzy of the last two years inflated salary costs.  India’s IT and startup sectors may lay off 15,000 to 20,000 employees in the next six months, battling slowing demand after the hiring frenzy of the last two years inflated salary costs.  Recruitment consultants expect fewer hiring mandates in the months ahead and have decided not to enter new businesses for now. Recruitment consultants expect fewer hiring mandates in the months ahead and have decided not to enter new businesses for now. However, even as some IT and startup companies will shed staff to manage costs, others within the same sectors are hiring, too. “We expect about 20,000 layoffs over the next few quarters. Over the last year, companies faced the fear of missing out

In [9]:
x.split(".")[0]

'Pressure from investors, an expected global recession and the domino impact of global layoffs have resulted in staff cuts  India’s IT and startup sectors may lay off 15,000 to 20,000 employees in the next six months, battling slowing demand after the hiring frenzy of the last two years inflated salary costs'