# SpaCy Training an Additional NER Tag

Training a new spacy english model to recognize a custom named entity recognition tag for dam names. Both CoreNLP and spaCys NER tags were inconsitently tagging what we would like to id as a possible entity of a dam. 

For training data we used the Stanford BRAT annotation web application and are parsing the ```x.ann``` files to create labels. The workflow is a modified version of what the spaCy docs recommend

```
# Note: If you're using an existing model, make sure to mix in examples of
# other entity types that spaCy correctly recognized before. Otherwise, your
# model might learn the new type, but "forget" what it previously knew.
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
```

In [1]:
import os
import spacy
import pandas as pd
import plac
import random
from pathlib import Path
from sklearn.metrics import classification_report
print('spacy version: %s' % spacy.__version__)

spacy version: 2.0.3


## Training Data 


__Brat mapping__

Extracting labels from the brat annotations (note: we copied the sentence into the notes field for easier access and mapped the existing NER tags for our use).
```
BRAT NER Tag: "PERSON" = River NER 

BRAT NER Tag: "ORGANIZATION" = Dam NER
```

### Training Data (spaCy) Format

```
[
    ( "sentence", {
        "entities": 
            [(char_start, char_end, "NER_TAG")] 
                  }
     )
]
```

In [2]:
def parse_brat_ann(directory='./data/train'):
    ''' Function to parse brat annotation files and return a dataframe
    
    Parameters
    ----------
    directory : directory
        Brat annotation directory.
    
    Returns
    -------
    pandas dataframe
        Containing document identifier, sentence, extraction word, indexes and ner.
    '''
    # List of annotations
    files = os.listdir('./data/train')
    
    # Return struct
    document_id, sentences, extractions, start_idxes, end_idxes, ner_tags = [], [], [], [], [], []
        
    for document in files:
        if document.endswith('.txt') or document.endswith('S_Store'):
            continue
        f = open(os.path.join(directory, document)).readlines()
        for idx, i in enumerate(f):
            if idx % 2 != 1:
                # parse
                lab, ner, extract = i.split('\t')
                ner = ner.split()[0]
                if ner == 'Organization': 
                    ner ='Dam'
                else: ner = 'River'
                sentence = f[idx + 1].split('\t')[2]
                start_idx = sentence.index(extract[:-1])            
                end_idx = start_idx + len(extract[:-1])

                # Toss into list to build dataframe
                ner_tags.append(ner.split()[0])
                document_id.append(document[:-4])
                sentences.append(sentence)
                extractions.append(extract.replace('\n', ''))
                start_idxes.append(start_idx)
                end_idxes.append(end_idx)
    return pd.DataFrame({'docid': document_id, 
                         'ner': ner_tags,
                         'sentence': sentences,
                         'extraction': extractions,
                         'start_idx': start_idxes, 
                         'end_idx': end_idxes})
df = parse_brat_ann()

# Test indexes to make sure \n
# df.iloc[0]['sentence'][140:178]
df.head(2)

Unnamed: 0,docid,end_idx,extraction,ner,sentence,start_idx
0,54d4f5abe138238471e7f468,337,South Fork Diversion Dam,Dam,Diversion dams were identified using comments ...,313
1,54d4f5abe138238471e7f468,31,Smith River,River,"For example "" "" the Smith River "" "" VA at USGS...",20


#### Filter Dam Only NER Tagged Labels

In [3]:
test = df[df['ner'] =='Dam']
print('%d Dam Labels\n' %test.shape[0])
test.head()

109 Dam Labels



Unnamed: 0,docid,end_idx,extraction,ner,sentence,start_idx
0,54d4f5abe138238471e7f468,337,South Fork Diversion Dam,Dam,Diversion dams were identified using comments ...,313
2,54d4f5abe138238471e7f468,100,Philpott Dam,Dam,"For example "" "" the Smith River "" "" VA at USGS...",88
4,54d4f5abe138238471e7f468,101,Roanoke Rapids Dam,Dam,"In contrast "" "" the Roanoke River "" "" NC -LRB-...",83
6,54d4f5abe138238471e7f468,133,John H . Kerr Dam,Dam,Although many of the dams in the Roanoke are u...,116
7,54d4f5abe138238471e7f468,48,Smith Mountain Dam,Dam,"In addition "" "" operations at Smith Mountain D...",30


### Train Test Split

In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(test, test_size=0.2)
print('Training: %d' %train.shape[0])
print('Testing: %d' %test.shape[0])

Training: 87
Testing: 22


### spaCy data format

In [5]:
TRAIN_DATA = []
TEST_DATA = []

for i in train.iterrows():
    TRAIN_DATA.append((i[1]['sentence'], 
                  {'entities': [(i[1]['start_idx'], i[1]['end_idx'], i[1]['ner'])]}))
for i in test.iterrows():
    TEST_DATA.append((i[1]['sentence'], 
                  {'entities': [(i[1]['start_idx'], i[1]['end_idx'], i[1]['ner'])]}))
#TRAIN_DATA[1]

## Train

In [None]:
# See spacy docs for more info
LABEL = 'Dam'
def main(model=None, new_model_name='dam-ner', output_dir='./dam-ner-spacy', n_iter=100):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank('en')  # create blank Language class
        print("Created blank 'en' model")

    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe('ner')

    ner.add_label(LABEL)   # add new entity label to entity recognizer

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update([text], [annotations], sgd=optimizer, drop=0.35,
                           losses=losses)
            print(losses)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta['name'] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        #print("Loading from", output_dir)
        #nlp2 = spacy.load(output_dir)
        #doc2 = nlp2(test_text)
        #for ent in doc2.ents:
            #print(ent.label_, ent.text)

main()

In [6]:
# quick test
nlp2 = spacy.load('./dam-ner-spacy')

doc2 = nlp2('There was a dam, the Wiefrich Dam was removed in 2010.')
for ent in doc2.ents:
    print('Label: %s' %ent.label_, '\n\nText: %s' %ent.text)

Label: Dam 

Text: Wiefrich Dam


## Testing

In [15]:
found = 0
pred = []
for i in test.iterrows():
    doc = nlp2(i[1]['sentence'])
    
    for ent in doc.ents:
        if i[1]['extraction'] == ent.text:
            print(ent.text)
            found += 1
            pred.append(ent.text)
        else: 
            print('Missed: ', i[1]['extraction'])
            pred.append('')
        #print('Label: %s' %ent.label_, '\n\nText: %s' %ent.text)
print(found)

Missed:  Otsego City Dam
Mississippi River Lock and Dam No . 26
Plainwell Dam
Roanoke Rapids Dam
Dwinnell Dam
Edwards Dam
Aswan High Dam
Roanoke Rapids Dam
Iron Gate Dam
Otsego City Dam
Otsego City Dam
Dwinnell Dam
Dwinnell Dam
Dwinnell Dam
Otsego City Dam
Dwinnell Dam
Mississippi River Lock and Dam No . 26
Missed:  Otsego City Dam
Otsego City Dam
Dwinnell Dam
Edwards Dam
Philpott Dam
Missed:  Philpott Dam
20


In [22]:
print(classification_report(test['extraction'], pred[:-1]))

                                        precision    recall  f1-score   support

                                             0.00      0.00      0.00         0
                        Aswan High Dam       1.00      1.00      1.00         1
                          Dwinnell Dam       1.00      1.00      1.00         6
                           Edwards Dam       1.00      1.00      1.00         2
                         Iron Gate Dam       1.00      1.00      1.00         1
Mississippi River Lock and Dam No . 26       1.00      1.00      1.00         2
                       Otsego City Dam       1.00      0.67      0.80         6
                          Philpott Dam       1.00      1.00      1.00         1
                         Plainwell Dam       1.00      1.00      1.00         1
                    Roanoke Rapids Dam       1.00      1.00      1.00         2

                           avg / total       1.00      0.91      0.95        22



  'recall', 'true', average, warn_for)


### Test on development document

In [23]:
example_doc = nlp2(open('/Users/Shared/text/Amos_2008_Thesis.txt').read())

for ent in example_doc.ents:
    print(ent.text)

Low Head
Dam
Pete Thompson
Methods of Dam
Passive Dam
Channels Following Dam
Passive Dam
Passive Dam
Stochastic Incipient Motion Criterion
Open Channel Flow
Water Data Coordination
Hawkesville Dam
Hawkesville Dam
Huttonville Dam
Huttonville Dam
ELEVATION(m)
Greenfield Dam
