### What is NER?

Named entity recognition (NER) is a sub-task of information extraction (IE) that seeks out and categorizes specified entities in a body or bodies of texts. NER is also known simply as entity identification, entity chunking and entity extraction. NER is used in many fields in artificial intelligence (AI) including natural language processing (NLP) and machine learning.

### What is Spacy?

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

### What are named entities?

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

### What entities are predicted by the spacy models?


| TYPE	| DESCRIPTION |
| ----- | ----------- |
|PERSON	|People, including fictional.
|NORP	|Nationalities or religious or political groups.
|FAC	|Buildings, airports, highways, bridges, etc.
|ORG	|Companies, agencies, institutions, etc.
|GPE	|Countries, cities, states.
|LOC	|Non-GPE locations, mountain ranges, bodies of water.
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)
|EVENT	|Named hurricanes, battles, wars, sports events, etc.
|WORK_OF_ART|Titles of books, songs, etc.
|LAW	|Named documents made into laws.
|LANGUAGE|Any named language.
|DATE	|Absolute or relative dates or periods.
|TIME	|Times smaller than a day.
|PERCENT|Percentage, including ”%“.
|MONEY	|Monetary values, including unit.
|QUANTITY|Measurements, as of weight or distance.
|ORDINAL|“first”, “second”, etc.
|CARDINAL|Numerals that do not fall under another type.

### Quick example of the Spacy Entity Recognition

In [1]:
import spacy

from spacy import displacy

nlp = spacy.load('en_core_web_sm')

doc = nlp("The Group of Eight (G8) refers to the group of eight highly industrialized nations—France, Germany, Italy, the United Kingdom, Japan, the United States, Canada, and Russia—that hold an annual meeting to foster consensus on global issues like economic growth and crisis management, global security, energy, and terrorism.")

for ent in doc.ents:
    print(ent.text, ent.label_)

print('---------------------------------')
displacy.render(doc, style='ent', jupyter=True)

The Group of Eight ORG
eight CARDINAL
France GPE
Germany GPE
Italy GPE
the United Kingdom GPE
Japan GPE
the United States GPE
Canada GPE
Russia GPE
---------------------------------


### Why Customize spacy NER?

This is really great but what if we want to use this for predicting our own custom defined NERs. 

> You'll write your own training from scratch

Let say we are third party product sellers for large Corporation. We get thousands of notes from them if there are any products which we have to stop selling and/or start selling new part if any. Extracting this information from these notes is very tedious and manually time consuming. If we dont extract this information from the notes immediately, it is going to impact our business and also lose the confidence of our customers if we keep selling the products which we are not suppose to sell.

### Steps to Create our own model and train it for predicting custom entities specific to my usecase:

#### Step 1: Create the training data

Training data should be created in the below mentioned format. 


> ( TEXT , { 'entities' : [(startindex,    endindex,       LABEL)]})


- TEXT is the actual text which you want the model to predict the entities from.
- 'entities' is the keyword
- startindex is the starting index of the entity you want the model to predict the label for.
- endindex is the starting index of the entity you want the model to predict the label for.

In [2]:
training_data = [('92898282 is discontinued', {'entities': [(0, 8, 'OLDVERSION')]}),
 ('87262727 is discontinued and 98782728 is introduced to replace it', {'entities': [(0, 8, 'OLDVERSION'), (29, 37, 'NEWVERSION')]}),
 ('Replaced by new version 89878292', {'entities': [(24, 32, 'NEWVERSION')]}),
 ('78272822 should be stopped selling and start selling 89782821', {'entities': [(0, 8, 'OLDVERSION'), (53, 61, 'NEWVERSION')]}),
 ('87827728 should be stopped selling immediately due to malfunctioning issues and start selling 98987872', {'entities': [(0, 8, 'OLDVERSION'), (94, 102, 'NEWVERSION')]}),
 ('Start selling 98989202', {'entities' : [(14, 22, "NEWVERSION")]}),
 ('Stop selling 87882922', {'entities' : [(14, 22, "NEWVERSION")]}),
 ('Stop using 87882922 and start selling 76798292', {'entities' : [(0, 22, "OLDVERSION"), (38, 47, 'NEWVERSION')]}),
 ('There is issue with 76767289, start selling replacing part 76798292', {'entities' : [(20, 28, "OLDVERSION"), (59, 67, 'NEWVERSION')]}),
 ("97987893 is the new item which is replacing the problematic 98972389", {'entities' : [(0, 8, "NEWVERSION"), (60, 68, 'OLDVERSION')]})] 

#### Step 2 : Create a blank English Model

> spacy.blank - Create a blank model of a given language class.

This creates a blank model for whatever language you want given it is supported by Spacy. Check out the documentation for more details.

In [22]:
# Create a blank 'en' model
nlp = spacy.blank('en')

#### Step 3: # Create a new entity recognizer and add it to the pipeline

We want to add NER component into the blank model. Current there is nothing in the pipeline. So when we add NER into the pipeline thats all we need to solve our usecase.

In [23]:
print(nlp.pipe_names)

[]


In [24]:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

In [25]:
print(nlp.pipe_names)

['ner']


#### Step 4 : Add the Labels which you want the NER to predict

Once we add the NER to pipeline, we need to tell what we want the NER to predict. We tell that by adding the labels to the model.

In [5]:
# Add the label 'GADGET' to the entity recognizer
ner.add_label("OLDVERSION")
ner.add_label("NEWVERSION")

#### Step 5 : Train the model to predict the new customer entities  

Now we have the model created, NER added to the pipeline and have told the model what labels we want the NER to predict.

It is time now to train the model:

- Start the training using begin_training
- Iterate through the training dataset 
- Shuffle the dataset at every iteration
- Update the model

In [7]:
import random

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(training_data)

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(training_data, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]
        
        # Update the model
        nlp.update(texts, annotations) 

#### Step 6: Testing your model and visualizing the predictions

Once we have the trained model, all we have to do is the testing. Check whether the model is performing well.

In [15]:
testing_data = ['87827728 should be stopped selling immediately due to malfunctioning issues and start selling 98987872',
             "START SELLING 87987298",
             "97987893 is the new item which is replacing the 88798990"]

# Processing each text in testing data
for doc in nlp.pipe(testing_data):
    print(doc.text)
    displacy.render(doc, style='ent', jupyter = 'True', options={'colors' : {'OLDVERSION' : "#f5540f" , "NEWVERSION" : '#089927'}})
    for ent in doc.ents:
        print(ent.text, ent.label_)
    print('------------------')

87827728 should be stopped selling immediately due to malfunctioning issues and start selling 98987872


87827728 OLDVERSION
98987872 NEWVERSION
------------------
START SELLING 87987298


87987298 NEWVERSION
------------------
97987893 is the new item which is replacing the 88798990


97987893 NEWVERSION
88798990 OLDVERSION
------------------


We just saw how easy and simple it is to build your own model and customizing it. 


Any machine learning model is just as good as the data you provide to it to learn. If the data is not good enough, model will not be intelligent. The only way to make it more intelligent is giving it more data not of the same kind but of various kinds covering all the scenarios. 