# Spacy Preprocessing for Named Entity Extraction

Before we begin executing a NER model we need to transform our existing data into a format for training. I wanted to share this notebook with the Kaggle community as I found the spacy docs a bit frustrating and I hope this might just help someone get off the ground a little faster!

In [None]:
import os
import json
import pandas as pd
from tqdm import tqdm

In [None]:
train_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')
sample_sub = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/sample_submission.csv')
train_files_path = '../input/coleridgeinitiative-show-us-the-data/train'
test_files_path = '../input/coleridgeinitiative-show-us-the-data/test'

### Appending text to training dataframe

In [None]:
def read_append_return(filename, train_files_path=train_files_path, output='text'):
    json_path = os.path.join(train_files_path, (filename+'.json'))
    headings = []
    contents = []
    combined = []
    with open(json_path, 'r') as f:
        json_decode = json.load(f)
        for data in json_decode:
            headings.append(data.get('section_title'))
            contents.append(data.get('text'))
            combined.append(data.get('section_title'))
            combined.append(data.get('text'))
    
    all_headings = ' '.join(headings)
    all_contents = ' '.join(contents)
    all_data = '. '.join(combined)
    
    if output == 'text':
        return all_contents
    elif output == 'head':
        return all_headings
    else:
        return all_data
    
tqdm.pandas()
train_df['text'] = train_df['Id'].progress_apply(read_append_return)

## Import Spacy

Import Spacy and load the english module, we can use this to generate POS (Part of Speech) tags for our text. These part of speech tags grammatically categorise the words in the text into categories like Noun, Verb, etc.

Spacy also allows you to use it's EntityRuler module to generate Entity Recognition Tags. We will likely need this in the future, when we go to build a NER pipeline.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

## Generate patterns for all of our datasets

Here I convert all of the dataset names into Spacy patterns for identification. This is what will label our entities as Datasets when we execute pass text through the spacy module. 

Note: As I have added all of the datasets to a single pattern which is run against all incoming text files, this could induce errors if the dataset names are used in other essays, but in reference to something that is not the dataset.

In [None]:
patterns = []
for dataset in train_df.cleaned_label.unique():
    phrase = []
    for word in nlp(dataset):
        pattern = {}
        pattern["LOWER"] = str(word)
        phrase.append(pattern)
    patterns.append({"label": "DATASET", "pattern": phrase})

### Remove default NER from Spacy pipeline

NER pipeline has an existing Entity Recognition module which we need to remove before adding our own.

In [None]:
nlp.remove_pipe("ner")

### Build Spacy Pipeline

Here we add a module to the spacy pipeline which matches the patterns we created above. It is possible to add other modules into the spacy pipeline, which may be of special interest later in the project.

In [None]:
from spacy.pipeline import EntityRuler

ruler = EntityRuler(nlp, overwrite_ents=True)
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

In [None]:
import time
start = time.time()

In [None]:
indexes = [0]

for index in indexes:
    doc = nlp(train_df.iloc[index]['text'].lower())
    print("{: >20} {: >20} {: >20} {: >20}\n".format('TEXT', 'POS', 'TAG', 'ENTITY'))
    for token in doc:
        print("{: >20} {: >20} {: >20} {: >20}".format(token.text, token.pos_, token.tag_, token.ent_type))

In [None]:
end = time.time()
hours, rem = divmod(end-start, 3600)
minutes, seconds = divmod(rem, 60)
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))