In [None]:
#all_flags

# 41-generic-framework-for-spacy-training
> Creating the framework for working with Spacy models

**Purpose**  The purpose of this notebook is provide basic code blocks for the following Spacy tasks:
1. [Training](#Training) - train Spacy NER model with pre-labelled text 
2. [Testing](#Testing) - test Spacy NER model

Based on https://spacy.io/usage/training#example-train-ner

The functionality will also allow users to load and save models as desired.

In [None]:
#default_exp modeling

In [15]:
#export
#no_test
#dependencies

#nlp packages
import nltk
import spacy
from spacy.util import minibatch, compounding
from spacy.training import Example

#ssda modules for testing
from ssda_nlp.collate import genSpaCyInput

# manipulation of tables/arrays
import pandas as pd
import numpy as np

# helpers
import random
import warnings
from pathlib import Path

## Modeling with Spacy

The following code describes how we can train a Spacy model.  When loading the model, it expects a file directory or pretrained model (available through Spacy) from which to start from.  If one is not provided, a blank, untrained model will be loaded with some pre-existing pipeline components.  The following components are what are expected as inputs to the following modules.

* training_data = list of tuples (string, JSON of labels).  The demo training data below shows the expected format.  
* output_dir = "/data/p_dsi/ssda/models/model_name"
* example model_name = "es_ssda_sm" per spaCy naming convention: https://spacy.io/models#conventions 

We can also indicate the languages of interest.  We'll be using Spanish and Portugese as indicated below:
* Spanish language code = 'es'
* Portuguese language code = 'pt'



In [16]:
#no_test
spacy.util.fix_random_seed()

In [17]:
#no_test

# demo training data
TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

# demo traing data in dataframe form
test_ids = [12]*2 + [31]
test_text = ['I like London and Berlin']*2 + ['Who is Shaka Khan?']
test_ents = ['London', 'Berlin', 'Shaka Khan']
test_types = ['LOC', 'LOC', 'PERSON']
test_starts = [7,18,7]
test_ends = [13,24,17]
test_df = pd.DataFrame({'ID': test_ids, 'text':test_text, 'entity':test_ents, 'label':test_types,
                        'start':test_starts, 'end':test_ends})
test_df.head()

Unnamed: 0,ID,text,entity,label,start,end
0,12,I like London and Berlin,London,LOC,7,13
1,12,I like London and Berlin,Berlin,LOC,18,24
2,31,Who is Shaka Khan?,Shaka Khan,PERSON,7,17


### Loading the model

The following code allows you to load a model.  Note that the model is not a specific file, but a directory or a Spacy model name.  When we load a blank model, we go ahead and initialize the weights with the functionality defined in `begin_training`.  This will allow us to call `resume_training` for any model type during training.

In [18]:
#export 

def load_model(model=None, language="en", verbose=False):
    '''
    Load the Spacy model or create blank model
        model: (default None) directory of any existing model or named Spacy model
        language: (default 'en') two-letter Spacy language code, default is English
        verbose: (default False) boolean reflecting whether to print status of model loading
            
        returns: Spacy Language object
    '''
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model 
        if verbose: print("Loaded model '%s'" % model)
    else:
        #Create new model
        nlp = spacy.blank(language)  # create blank Language model
        
        # defaults to English, unless different language passed to function
        if verbose: print("Created blank '" + language)
        
    return nlp

#### Unit testing: `load model`
Here, I'll make sure the execution paths work as expected.  I'll work on the the actual nlp object returned when I unit test `train_model`.

In [20]:
#no_test
#Ensure loading the model works with spacy model
nlp_from_spacy = load_model('es_core_news_sm', verbose=True)
type(nlp_from_spacy)

Loaded model 'es_core_news_sm'


spacy.lang.es.Spanish

In [21]:
#no_test
#Ensure loading the model works with blank model
nlp_blank = load_model(language='en', verbose=True)
type(nlp_blank)

Created blank 'en


spacy.lang.en.English

In [22]:
#no_test
#Make sure verbose is working
nlp_verb = load_model(model='es_core_news_sm')

Everything seems to be working as expected.

## Training the model
The `train_model` function allows us to train the named entity recognition (NER) pipeline component of the NLP object.

In [23]:
#export    

def train_model(nlp, training_data, n_iter=100, dropout=0.5, compound_params=None, solver_params=None):
    
    '''
    Train the `ner` component of the provided Language object (model)
        nlp: Language object (model) - blank or pretrained.  Usually created by `load_model`.
        training_data: pre-labelled training data in Spacy format
        n_iter: (default 100)  Integer number of training iterations
        dropout: (default 0.5)  Float value of ratio of dropout to prevent network memorization.
        compound_params: (default None) dictionary of keys `start`, `end`, and `cp_rate` (defaults 4, 32, and 1.001).  Refer to the starting 
            number of elements in the batch (start), the maximum number of elements in a batch (end), and the multiplier of `start` to do the 
            compounding (cp_rate).  Pass in a dictionary of one or more of these keys to change those specific default parameters.
        solver_params: (default None) dictionary of keys relating to the parameters of the Adam solver.  Allowable parameters include:
            `learn_rate`, `b1`, `b2`, `L2`, `max_grad_norm`, with defaults 0.001, 0.9, 0.999, 1e-6, and 1.0.  See
            https://github.com/explosion/spaCy/blob/master/spacy/_ml.py : `create_default_parameters` for more information.
        
        returns: trained Language object, pandas Dataframe object with losses per iteration
    '''
    
    # create the built-in pipeline components and add them to the pipeline
    # use ner_init for blank models whose ner weights will need to be initialized later
    ner_init = False
    if "ner" not in nlp.pipe_names:
        #ner = nlp.create_pipe("ner")
        nlp.add_pipe("ner", last=True)
        ner_init = True
        ner = nlp.get_pipe("ner")
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")
    
    for _, annotations in training_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    
    # only train NER
    with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')
        
        # initialize the weights if a blank model was passed in, otherwise, use existing weights.
        if ner_init:
            nlp.begin_training()
        else:
            nlp.resume_training()
            
        # set optimizer values to be the ones passed in if desired
        allowable_params = ['learn_rate', 'b1', 'b2', 'L2', 'max_grad_norm']
        if solver_params is not None:
            for key, val in solver_params.items():
                if key in allowable_params:
                    setattr(nlp._optimizer, key, val)
                else:
                    raise ValueError('Key "{0}" not supported for solver. Only values {1} allowed'.format(key, allowable_params))
        
        # set compounding parameters to be passed in and ensure they are floats
        cp_params = {'start': 4.0, 'end': 32.0, 'cp_rate': 1.001}
        if compound_params is not None:
            for key, val in compound_params.items():
                if key in cp_params:
                    cp_params[key] = float(compound_params[key])
                else:
                    raise ValueError('Key "{0}" not supported for compounding.  Only `start`, `end`, `cp_rate` allowed.'.format(key))
        
        # create dataframe to be returned
        losses_df = pd.DataFrame(np.zeros(shape=(n_iter, 1)), columns=['epoch_loss'])
        
        # train batches of data for n_iter iterations
        examples = []
        for text, annots in training_data:    
            examples.append(Example.from_dict(nlp.make_doc(text), annots))
            #nlp.initialize(lambda: examples)
        
        for itn in range(n_iter):
            random.shuffle(examples)
            losses = {}
            
            # Create variable size minibatch
            batches = minibatch(examples, size=compounding(cp_params['start'], cp_params['end'], cp_params['cp_rate']))
            for batch in batches:                
                #implement dropout decay?
                nlp.update(
                    batch,
                    drop = dropout,  
                    losses = losses,
                )               
                
            # Update df with loss stats
            losses_df.loc[itn, 'epoch_loss'] = losses['ner']
     
    return nlp, losses_df

#### Unit testing: `train_model`: reproducibility
The purpose of this section is to explore how we can create reproducible models.  This should essentially come down to setting a seed, but let's make sure that is actually the case.

In [24]:
#no_test
#load load model 1
nlp1 = load_model()
nlp1, losses1 = train_model(nlp1, TRAIN_DATA, n_iter=10)
print("Losses from model 1:")
display(losses1)

Losses from model 1:


Unnamed: 0,epoch_loss
0,9.899999
1,9.654533
2,9.307241
3,9.048512
4,8.749421
5,8.319594
6,8.014254
7,7.699785
8,7.399865
9,6.83906


The results should be as follows for an initial run:

| epoch_loss |
| ----------- |
| 9.899999 |
| 9.727400 |
| 9.412539 |
| 9.161101 |
| 8.956753 |
| 8.842925 |
| 8.512361 |
| 7.856238 |
| 7.357123 |
| 6.403693 |

And they are!  However, this functionality may suffer from the behavior described in [this Spacy issue](https://github.com/explosion/spaCy/issues/5551) which we have an issue for [in our repo.](https://github.com/vanderbilt-data-science/ssda-entity-extraction/issues/84)

#### Unit testing: `train_model`: solver parameters
This uses the models previously created.  All tests below run as expected.

In [None]:
#no_test
def validate_opt_params(opt, test_params):
    '''
    validate_opt_params: a simple function helper making sure the passed in parameters are the same as
        what was used in the default Adam solver
        opt: Spacy Language object (model) optimizer
        test_params: dictionary of solver parameters to be tested against
    '''
    nlp_opt_params = {'learn_rate' : opt.learn_rate, 'b1':opt.b1, 'b2':opt.b2, 'L2':opt.L2,
                      'max_grad_norm':opt.max_grad_norm}
    
    #fix test parameters to make sure they add any missing fields;
    #this keeps the values of test_params if they also appear in nlp_opt_params
    all_test_params = {**nlp_opt_params, **test_params}
    
    #make sure the result is correct
    assert nlp_opt_params == all_test_params
    
    #show dataframe for sanity check
    display(pd.DataFrame({'optimizer_params':nlp_opt_params, 'user_params':all_test_params}))

In [None]:
#no_test
# Original values for optimizer
nlp2, losses = train_model(nlp_from_spacy, TRAIN_DATA, n_iter=5, solver_params = None)
validate_opt_params(nlp2._optimizer, {})

Unnamed: 0,optimizer_params,user_params
learn_rate,0.001,0.001
b1,0.9,0.9
b2,0.999,0.999
L2,1e-06,1e-06
max_grad_norm,1.0,1.0


In [None]:
#no_test
# Ensure that setting parameters works correctly
solver_params_correct = {'learn_rate':0.003, 'b1':0.8, 'b2':0.99, 'L2':1e-5, 'max_grad_norm':1.1}
nlp2, losses = train_model(nlp_from_spacy, TRAIN_DATA, n_iter=5, solver_params = solver_params_correct)
validate_opt_params(nlp2._optimizer, solver_params_correct)

Unnamed: 0,optimizer_params,user_params
learn_rate,0.003,0.003
b1,0.8,0.8
b2,0.99,0.99
L2,1e-05,1e-05
max_grad_norm,1.1,1.1


In [None]:
#no_test
# Ensure that it works for only a few specified parameters
solver_params_sub = {'learn_rate':0.004}
nlp2, losses = train_model(nlp_from_spacy, TRAIN_DATA, n_iter=5, solver_params = solver_params_sub)
validate_opt_params(nlp2._optimizer, solver_params_sub)

Unnamed: 0,optimizer_params,user_params
learn_rate,0.004,0.004
b1,0.9,0.9
b2,0.999,0.999
L2,1e-06,1e-06
max_grad_norm,1.0,1.0


Take a look at the result above.  When we trained the same model and modified only the learning rate, we actually maintained the same optimizer parameters from the previous call to the model.  This makes intuitive sense, and it is nice to see a confirmation of this result.  This is something we'll have to keep in mind, though.

In [None]:
#no_test
# Make sure it works for the blank models too; first lets make sure it maintains the defaults
solver_params_correct = {'learn_rate':0.004}
nlp2, losses = train_model(nlp_blank, TRAIN_DATA, n_iter=5, solver_params = solver_params_correct)
validate_opt_params(nlp2._optimizer, solver_params_correct)

Unnamed: 0,optimizer_params,user_params
learn_rate,0.004,0.004
b1,0.9,0.9
b2,0.999,0.999
L2,1e-06,1e-06
max_grad_norm,1.0,1.0


In [None]:
#no_test
# Make sure it works for the blank models for whole parameter list
solver_params_correct = {'learn_rate':0.003, 'b1':0.8, 'b2':0.99, 'L2':1e-5, 'max_grad_norm':1.1}
nlp2, losses = train_model(nlp_blank, TRAIN_DATA, n_iter=5, solver_params = solver_params_correct)
validate_opt_params(nlp2._optimizer, solver_params_correct)

Unnamed: 0,optimizer_params,user_params
learn_rate,0.003,0.003
b1,0.8,0.8
b2,0.99,0.99
L2,1e-05,1e-05
max_grad_norm,1.1,1.1


In [None]:
#no_test
# Ensure it fails for wrong values
solver_params_incorrect = {'lr':0.004}
try:
    nlp2, losses = train_model(nlp_from_spacy, TRAIN_DATA, n_iter=5, solver_params = solver_params_incorrect)
except Exception as e:
    warnings.warn('Failed with exception: {0}'.format(e))

  import sys


#### Unit testing: `train_model`: compounding
Here, I'm just going to make sure the parameters are accepted correctly.  All tests appear to run correctly.

In [None]:
#no_test
#compounding parameters
cmp_params_correct = {'start':5, 'end':20, 'cp_rate':2}
cmp_params_sub = {'start':6}
cmp_params_incorrect = {'begin':10}

In [None]:
#no_test
#for full parameters
nlp2, losses = train_model(nlp_from_spacy, TRAIN_DATA, n_iter=5, compound_params = cmp_params_correct)

In [None]:
#no_test
#for partially specified parameters
nlp2, losses = train_model(nlp_from_spacy, TRAIN_DATA, n_iter=5, compound_params = cmp_params_sub)

In [None]:
#no_test
# incorrectly specified fields should fail
try:
    nlp2, losses = train_model(nlp_from_spacy, TRAIN_DATA, n_iter=5, compound_params = cmp_params_incorrect)
except Exception as e:
    warnings.warn('Failed with exception: {0}'.format(e))

  


#### Unit testing: `train_model`: losses

In [None]:
#no_test
nlp_blank = load_model()
nlp_ls, loss_df = train_model(nlp_blank, TRAIN_DATA*10, n_iter=5)
loss_df

Unnamed: 0,epoch_loss
0,94.074417
1,75.67691
2,50.471448
3,46.0485
4,40.819608


As you can see, the train data has been replicated 10 times so that a batch size greater than 4 could be obtained.  Here, you see the total batch loss per iteration as `epoch_loss`.

## Saving the model
Save model requires a directory and the model (pipeline) of interest.

In [None]:
#export 

def save_model(nlp_model, output_dir):

    '''
    Save the Language object model to directory specified by `output_dir`
       nlp_model: Language (pipeline) object
       output_dir: output directory string - relative or absolute
       returns: none - saves model to directory and prints directory
    '''
    
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir(parents=True)
        nlp_model.to_disk(output_dir)  # nlp.to_disk('/data/p_dsi/ssda/models/model_name')
        print("Saved model to", output_dir)

    return

#### Unit testing: `save_model`
By default, the line `output_dir.mkdir()` requires a `parents=True` statement if any parent files of the directory need to be created.  For example, if I need to create `models/test_model` and I don't have `models` already created, the call will fail.  Verified that this works will full filepaths as well.

In [None]:
#no_test
save_dir = 'new_dir/test_model'
save_model(nlp2, save_dir)

Saved model to new_dir\test_model


In [None]:
#no_test
blank_model = load_model(language="es")

In [None]:
#no_test
save_model(blank_model, "blank_spanish")

Saved model to blank_spanish


## Testing the model

### Code development
Here, we test the model perfomance using *labeled* data.  This function will return performance metrics as well as a dataframe of predictions.  The anticipated use case will be that we will operate on the original dataframe of labelled entities with the source identifier present.  I use the `test_df` created in the beginning cell with `TRAIN_DATA`.

In [None]:
#no_test
# create new model
nlp_en = load_model()
nlp_en, loss_df = train_model(nlp_en, TRAIN_DATA, n_iter=50)

We can process directly on this dataframe using `apply`.

In [32]:
#export
def _test_model(row, nlp_model, id_colname, text_colname):
    '''
    Internal helper function which uses the nlp model to predict the `text` rows of the dataframe and match with their ID.
        *shouldn't be used directly; is used with an `apply` statement for a pandas DF
        row: row of dataframe which contains at least columns `ID` and `text`
        nlp_model: Spacy Language object (model) to perform predictions
        id_colname: String name of the column where the ID is stored
        text_colname: String name of the column where the text is stored
        returns: Long pandas dataframe with columns reflecting entities recognized, entity types, and spans
    '''
    
    doc = nlp_model(row[text_colname])
    doc_ents = [(ents.text, ents.label_, ents.start_char, ents.end_char) for ents in doc.ents]
    
    # Make sure that actual entities were extracted or else, return None for everything
    if doc_ents != []:
        pred_ent_names, pred_ent_types, pred_ent_start, pred_ent_end = zip(*doc_ents)
        res_sz = len(pred_ent_names)
    else:
        pred_ent_names, pred_ent_types, pred_ent_start, pred_ent_end = None, None, None, None
        res_sz = 1
    
    df_res = pd.DataFrame({id_colname: [row[id_colname]]* res_sz, #to fix when doc_ents is None
                           'pred_entity': pred_ent_names,
                           'pred_label': pred_ent_types,
                           'pred_start': pred_ent_start,
                           'pred_end': pred_ent_end})
    return df_res

In [None]:
#no_test
# Get unique entries in the test_df
entries_df = test_df[['ID', 'text']].drop_duplicates()
display(entries_df.head())

#Get predicted entities and show
preds = entries_df.apply(_test_model, axis=1, args=[nlp_en, 'ID', 'text']).tolist()
preds_df = pd.concat(preds, ignore_index=True)
preds_df.head()

Unnamed: 0,ID,text
0,12,I like London and Berlin
2,31,Who is Shaka Khan?


Unnamed: 0,ID,pred_entity,pred_label,pred_start,pred_end
0,12,London,LOC,7,13
1,12,Berlin,LOC,18,24
2,31,Shaka Khan,PERSON,7,17


What's interesting is that now, this `preds_df` can be joined with `test_df`, and some further analysis of the performance can be done.  An example of this join is shown below:

In [None]:
#no_test
res_df = pd.merge(test_df, preds_df, left_on=['ID', 'entity', 'label'], right_on=['ID', 'pred_entity', 'pred_label'], how='outer')
res_df

Unnamed: 0,ID,text,entity,label,start,end,pred_entity,pred_label,pred_start,pred_end
0,12,I like London and Berlin,London,LOC,7,13,London,LOC,7,13
1,12,I like London and Berlin,Berlin,LOC,18,24,Berlin,LOC,18,24
2,31,Who is Shaka Khan?,Shaka Khan,PERSON,7,17,Shaka Khan,PERSON,7,17


### Module code version
Here, the `evaluate` and `Scorer` have been utilized straightforwardly from Spacy.
- For more information on `evaluate`, see here:  [Language.evaluate](https://spacy.io/api/language#evaluate)
- For more information on `Scorer`, see here: [Scorer](https://spacy.io/api/scorer)

In [49]:
#export

# test the trained model
def test_model(nlp_model, testing_df, id_colname, text_colname, score_model=True):    
    '''
    Use the model to predict the entities and labels of the testing data using the model specified; optionally return precision/recall metrics 
        nlp_model: nlp object to be evaluated
        testing_df: original dataframe with columns `ID`, `text`, `entity*`, `label*`, `start*`, and `end*`.
            `ID`: identifier for each entry/text
            `text: the text of each text
            `entity`: the person, place, etc within the text to be identified (note: this data frame will be long since there can be
                multiple entities for the same entry)
            `label`: the entity type (e.g., PER, LOC)
            `start`: starting character of the entity in the entry
            `end`: one past the ending character of the entity in the entry
            `*` indicates columns that are required only if score_model=True
        id_colname: String name of the column where the entry ID is stored
        text_colname: String name of the column where the text is stored
        score_model: (default True) boolean indicating whether to return precision/recall metrics associated with the model predictions.  Must pass
            *'ed columns in the dataframe.

        returns: dataframe of prediction results,
                 dataframe of overall precision, recall, and fscore (None if score_model is False)
                 dataframe of per-entity precision, recall, and fscore (None if score_model is False)
    '''
    
    # get unique entries/texts from the dataframe as a dataframe
    # using id_colname and text_colname both here should be OK as they are 1:1
    unique_entries_df = testing_df[[id_colname, text_colname]].drop_duplicates()
    
    # Get predicted entities
    preds = unique_entries_df.apply(_test_model, axis=1, args=[nlp_model, id_colname, text_colname]).tolist()
    preds_df = pd.concat(preds, ignore_index=True)
    
    # Get model performance
    pred_metrics = None
    per_ent_metrics = None
    
    if score_model:
        spacy_test_data = genSpaCyInput(testing_df)
        
        examples = []
        for text, annots in spacy_test_data:    
            examples.append(Example.from_dict(nlp_model.make_doc(text), annots))
        
        nlp_scorer = nlp_model.evaluate(examples)

        # Build dataframe of results
        pred_metrics = pd.DataFrame({'precision': [nlp_scorer['ents_p'] * 100],
                                     'recall': [nlp_scorer['ents_r'] * 100],
                                     'f_score': [nlp_scorer['ents_f'] * 100],
                                    })

        per_ent_metrics = pd.DataFrame({**nlp_scorer['ents_per_type']})
    
    return preds_df, pred_metrics, per_ent_metrics    

#### Unit testing: `test_model`

Below, we can see the expected use case, performance of the model, and expected outputs.  Here, we use a blank english model and train it 10-50 iterations.  I've verified that when no entities are extracted, the code still works fine.

In [52]:
#no_test
#load and train model
nlp_en = load_model(verbose=False)
nlp_en, losses = train_model(nlp_en, TRAIN_DATA, n_iter=30)

#test model
ent_preds_df, metrics_df, per_ent_metrics = test_model(nlp_en, test_df, 'ID', 'text')

#show performance
print('Predicted entities:')
display(ent_preds_df)

print('\nOverall performance:')
display(metrics_df)

print('\nEntity-specific performance:')
display(per_ent_metrics)

Predicted entities:


Unnamed: 0,ID,pred_entity,pred_label,pred_start,pred_end
0,12,,,,
1,31,,,,



Overall performance:


Unnamed: 0,precision,recall,f_score
0,0.0,0.0,0.0



Entity-specific performance:


Unnamed: 0,LOC,PERSON
p,0.0,0.0
r,0.0,0.0
f,0.0,0.0


Now, I try it without the scorer and with a subset of the columns of the dataframe, which works as expected.

In [None]:
#no_test
#test model
ent_preds_df,_,_ = test_model(nlp_en, test_df[['ID', 'text']], 'ID', 'text', score_model=False)

#show performance
print('Predicted entities:')
display(ent_preds_df)

Predicted entities:


Unnamed: 0,ID,pred_entity,pred_label,pred_start,pred_end
0,12,London,LOC,7.0,13.0
1,31,,,,


## Entire pipeline example

In [None]:
#no_test
save_dir = 'test_dir/en_model'

#load and train model
new_model = load_model(verbose=False)
new_model, losses = train_model(new_model, TRAIN_DATA, n_iter=50)
print('Training losses:')
display(losses.head())

Training losses:


Unnamed: 0,epoch_loss
0,9.899999
1,9.730163
2,9.271213
3,9.219586
4,8.519299


In [None]:
#no_test
#test model
pred_df, mets_df, per_ent_df = test_model(new_model, test_df, 'ID', 'text')
print('Entity predictions:')
display(preds_df)
print('\nOverall performance:')
display(mets_df)
print('\nPer entity performance:')
display(per_ent_df)

#save model
save_model(new_model, save_dir)

Entity predictions:


Unnamed: 0,ID,pred_entity,pred_label,pred_start,pred_end
0,12,London,LOC,7,13
1,12,Berlin,LOC,18,24
2,31,Shaka Khan,PERSON,7,17



Overall performance:


Unnamed: 0,precision,recall,f_score
0,66.666667,66.666667,66.666667



Per entity performance:


Unnamed: 0,LOC,PERSON
p,100.0,0.0
r,100.0,0.0
f,100.0,0.0


Saved model to test_dir\en_model


In [53]:
#no_test
from nbdev.export import notebook2script
notebook2script();

Converted 12-ssda-xml-parser.ipynb.
Converted 31-collate-xml-entities-spans.ipynb.
Converted 33-split-data.ipynb.
Converted 41-generic-framework-for-spacy-training.ipynb.
Converted 42-initial-model.ipynb.
Converted 51-data-preprocessing.ipynb.
Converted 52-unstructured-to-markup.ipynb.
Converted 53-markup-to-spatial-historian.ipynb.
Converted 54-utility-functions.ipynb.
Converted 61-prodigy-output-training-demo.ipynb.
Converted 62-full-model-application-demo.ipynb.
Converted 63-pt-model-training.ipynb.
Converted 64-es-model-training.ipynb.
Converted 65-all-annotations-model-training.ipynb.
Converted 66-es-guatemala-model-training.ipynb.
Converted 67-death-and-birth-records-together.ipynb.
Converted 70-exhaustive-training.ipynb.
Converted 71-relationship-builder.ipynb.
Converted 72-full-volume-processor.ipynb.
