<h1 style="background-color:DodgerBlue; color:white" >Custom NER using Spacy 3.0+</h1>

Recently, in my work, I did custom NER using production-level NLP library called spaCy.  

Utilizing that experience, this notebook aims to train a custom NER transformer-based model to detect datasets as entities. For achieving this, we require spaCy 3.0+.

The whole process is quite straightforward:
1. Make your training dataset by marking entities in it. spaCy 3.0 requires DocBin format. 
    - For our problem, the training labels help us mark the entities. (the **positive examples**)
    - Rest lines could be our **negative examples** with start and end indexes of entity has 0,0
    - **Caution:** In this competition, train data is not exhaustively labeled. That means, we have some positive examples inside the examples that we mark as negative. You would ideally want to increase the class-prior weight of the positive examples we already know.
2. Initialize spacy with a config file (**spacy init** command)
3. Train spacy model using the settings mentioned in config file (**spacy train** command)
4. Load the model and use it like any other spacy pipeline (**spacy.load()** command)


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import json
import glob
import re
from tqdm import tqdm

**Note:**  This notebook uses internet, therefore, you cannot submit it as submission. However, you can take the trained model and use it make your submissions.

## Install Spacy 3.0.+ Transformers

In [None]:
!pip install -U spacy[transformers]

## Predefined function for prepropossing

For preprocessing, we stick to the given function which replaces anything apart from letters and digits with a ' '. However, for training our spaCy model, we do not lowercase the text

In [None]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

# does not lowercase the text
def clean_text2(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt))

# Read train csv and create a sample (for faster demo)

In [None]:
df = pd.read_csv("../input/coleridgeinitiative-show-us-the-data/train.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
# number of unique labels
len(df.cleaned_label.unique()) 

In [None]:
# create a subset for quick demo
sample = df.sample(500)
sample.shape

## Create the training dataset by marking entries

In [None]:
# get positive and negative examples for entities
POSITIVE_DATA = []
NEGATIVE_DATA = []
for idx,row in tqdm(sample.iterrows()):
    pub = "../input/coleridgeinitiative-show-us-the-data/train/" + row.Id + ".json"
    f = open(pub)  
    data = json.load(f)
    paper_text = str([sec['text'] for sec in data]).strip("[").strip("]")
    sentences = paper_text.split(".")
    for sentence in sentences:
        sentence2 = clean_text(sentence) # use given clean_text to find cleaned_label
        a = re.search(row.cleaned_label,sentence2)
        if  a != None: # if label is found, make it a positive example
            POSITIVE_DATA.append((clean_text2(sentence),{"entities":[(a.span()[0],a.span()[1],"DATASET")]}))
        else: # if label is not found, make it a negative example
            if len(clean_text2(sentence))>20: # greater than 20 chars
                NEGATIVE_DATA.append((clean_text2(sentence),{"entities":[(0,0,"DATASET")]}))

In [None]:
POSITIVE_DATA[0:10]

In [None]:
len(POSITIVE_DATA)

In [None]:
len(NEGATIVE_DATA)

## We have an IMBALANCED CLASS problem.
#### For brevity, let's downsample negative class to 2000 examples

In [None]:
import random
NEG_SAMPLE = random.choices(NEGATIVE_DATA, k=2000) # downsampling negative class

In [None]:
TRAIN_DATA = np.array(POSITIVE_DATA + NEG_SAMPLE) # our train data is positive + negative examples
np.random.shuffle(TRAIN_DATA) # shuffle the train data
len(TRAIN_DATA) # total examples in train data

## Spacy 3.0 uses DocBin format - convert train set to this format
####  DocBin is highly efficient serializable format used by spaCy3.0 
Use below converter to change above train_set into new format

In [None]:
import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            pass
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

# Train the spaCy transformer model
https://spacy.io/usage/training#quickstart

In [None]:
# step1: Get baseconfig file from https://spacy.io/usage/training#quickstart
!cp "../input/spacybaseconfigcfg/base_config.cfg" ./

In [None]:
# step2: initialize the base config file. 
# Config file contains the training settings. 
# Init with spacy init initializes it with most common settings
!python -m spacy init fill-config base_config.cfg config.cfg

In [None]:
# step3: train using spacy train command
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy --gpu-id 0

### Explaining Training Pipeline Variables

- E is epochs
- Loss Transformer
- Loss NER
- ENTS_F is f score
- ENTS_P is precision
- ENTS_R is recall
- Score is to score the model (in order to pick best model later)

# Load the custom NER model and predict.

In [None]:
from thinc.api import set_gpu_allocator, require_gpu
set_gpu_allocator("pytorch")
require_gpu(0)
# Use spacy.load to load your custom model
custom_ner_model = spacy.load("./output/model-best") # output model is stored as "model-best" and "model-last"

In [None]:
test_pubs = glob.glob("../input/coleridgeinitiative-show-us-the-data/test/*.json")

In [None]:
from spacy import displacy

for index, pub in enumerate(test_pubs):
    f = open(pub)
    data = json.load(f)
    paper_text = str([sec['text'] for sec in data]).strip("[").strip("]")
    sentences = paper_text.split(".")
    for sentence in sentences:
        sentence = clean_text2(sentence)
        doc = custom_ner_model(sentence)
        if len(doc.ents) > 0:
            displacy.render(doc, style="ent", jupyter=True)
        

# References
1. https://spacy.io/usage/training

#### This is my first notebook on Kaggle. Your feedback and suggestions would be appreciated! - Shivam