# Introduction
This notebook is a continuation of my [EDA Notebook](https://www.kaggle.com/jagdmir/coleridge-ner-using-spacy)
I have tried to use SPACY model(NER) to identify the datasets within the publications!

Named-entity recognition (NER) is the process of automatically identifying the entities discussed in a text and classifying them into pre-defined categories such as ‘person’, ‘organization’, ‘location’ and so on. 

The spaCy library allows you to train NER models by both updating an existing spacy model to suit the specific context of your text documents and also to train a fresh NER model from scratch.

Named Entity Recognition is implemented by the pipeline component `ner`. Most of the models have it in their processing pipeline by default.

In [None]:
import numpy as np
import pandas as pd
import os,re
import glob
import json
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

In [None]:
# load train.csv
train_csv = pd.read_csv("../input/coleridgeinitiative-show-us-the-data/train.csv")
train_csv.head()

In [None]:
# no. of unique labels in the dataset
train_csv.dataset_label.nunique()

In [None]:
# take one sample of each of the dataset label
train_csv.drop_duplicates(subset="dataset_label",inplace=True)

In [None]:
# take a copy to the training dataset
train = train_csv.copy()

# Data Preparation

SpaCy accepts training data as list of tuples.

Each tuple should contain the text and a dictionary. 

The dictionary should hold the `start` and `end` indices of the `named enity` in the text, and the `category or label` of the named entity.

For example, ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]})

Let's do this!

In [None]:
def clean_sentence(txt):
     return re.sub('[^A-Za-z0-9.]+', ' ', str(txt).lower())   

In [None]:
import nltk
DATA = []
ent_count = 0
empty_count = 0

for idx,row in tqdm(train.iterrows()):
    pub = "../input/coleridgeinitiative-show-us-the-data/train/" + row.Id + ".json"            
    f = open(pub)  
    data = json.load(f)      
    
    balanced = False

    sentences = nltk.tokenize.sent_tokenize(str(data))

    for sentence in sentences:          
        sentence = clean_sentence(sentence).strip()        
        a = re.search(row.dataset_label.lower(),sentence)            
        if  a != None:
            DATA.append((sentence,{"entities":[(a.span()[0],a.span()[1],"DATASET")]}))
            ent_count = ent_count + 1
            balanced = True
        else:
            if balanced:
                DATA.append((sentence,{"entities":[]}))
                empty_count = empty_count + 1
                balanced = False
print("Text with entities:",ent_count,"Text without entities:",empty_count)

In [None]:
len(DATA)

In [None]:
TRAIN_DATA = DATA

# Model Building

1. To train an ner model, the model has to be looped over the example for sufficient number of iterations. 

2. Before every iteration it’s a good practice to shuffle the examples randomly throughrandom.shuffle() function .
   This will ensure the model does not make generalizations based on the order of the examples.

3. The training data is usually passed in batches. 
   We can call the minibatch() function of spaCy over the training data that will return you data in batches . 
   The minibatch function takes size parameter to denote the batch size. 
   
4. In each iteration , the model or ner is updated through the nlp.update() command. 

   Parameters of nlp.update() are :

*     docs: This expects a batch of texts as input. You can pass each batch to the zip method, which will return you batches     of text and annotations.
 
*     golds: You can pass the annotations we got through zip method here
 
*     drop: This represents the dropout rate.
 
*     losses: A dictionary to hold the losses against each pipeline component. Create an empty dictionary and pass it here.
 
At each word, the update() it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t , it adjusts the weights so that the correct action will score higher next time.

Finally, all of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.

In [None]:
import random
import spacy
from spacy.util import minibatch, compounding
from pathlib import Path

def train_spacy(TRAIN_DATA, iterations, model):
    #TRAIN_DATA = data
    print(f"downloads = {model}")
    if model is not None:
        print(f"training existing model")
        nlp = spacy.load(model)
        print("Model is Loaded '%s'" % model)
    else:
        print(f"Creating new model")

        nlp = spacy.blank('en')  # create blank Language class

    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    else:
        ner = nlp.get_pipe('ner')

    # Based on template, get labels and save those for further training
    
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        if model is None:
            optimizer = nlp.begin_training()
        else:
            optimizer = nlp.entity.create_optimizer()
        tags = dict()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 16.0, 1.001))
            # type 2 with mini batch
            for batch in batches:                
                texts, annotations = zip(*batch)                
                golds = annotations 
                nlp.update(
                    texts,  # batch of texts
                    golds,  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data                    
                    losses=losses,
                    sgd=optimizer
                )
            print(losses)
    return nlp

In [None]:
# Train the model for 1 iteration (for faster submission)
model = train_spacy(TRAIN_DATA,10,"en_core_web_sm") # pass "en_core_web_sm" if you want to use pre trained spacy model

# Make Predictions

In [None]:
# getting list of publication ids in the test set
test_pubs = pd.read_csv("../input/coleridgeinitiative-show-us-the-data/sample_submission.csv").Id
test_pubs

# load submission.csv
sub = pd.read_csv("../input/coleridgeinitiative-show-us-the-data/sample_submission.csv")

In [None]:
i = 0

for pub in test_pubs:    
    print("pub:",pub)
    
    f = open("../input/coleridgeinitiative-show-us-the-data/test/" + pub + ".json")  
    
    predicted_text = ""
    
    data = json.load(f)      

    sentences = nltk.tokenize.sent_tokenize(str(data))

    for sentence in sentences:          
        sentence = clean_sentence(sentence).strip()        
        doc = model(sentence)
        for ent in doc.ents:
            predicted_text = predicted_text + ent.text + "|"
    
            #print("pub:",pub, "\n",predicted_text[:-1].strip(),"\n")

    print("final:",predicted_text[:-1])
    sub.PredictionString.loc[i] = predicted_text[:-1].strip()
    
    i = i + 1

In [None]:
# Finally!
sub.to_csv('submission.csv',index=False)
sub
