# Adding Custom Components and Factories to spaCy and Packaging Model

In [1]:
import spacy
from spacy.language import Language

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


## Video

In [4]:
%%HTML
<center><iframe width="560" height="315" src="https://www.youtube.com/embed/2xi9SvgFLks" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></center>

## Creating the Custom Component

Here we will add our custom component. In spaCy there are two ways to do this, either as a class or as a function. In the case of classes, you will make a **factory**. In the case of functions, you will make a **component**. Let's take each in turn.

In [2]:
#the function for removing PERSON from our output
@Language.component("remove_person")
def remove_person(doc):
    final = []
    for ent in doc.ents:
        if ent.label_ != "PERSON":
            final.append(ent)
    doc.ents = final
    return (doc)

## Loading the Base Model

Now that we have our component ready, let's bring in a model that we intend to add it to. In our case, we will add it to the spaCy small English model, so let's load it in.

In [3]:
nlp = spacy.load("en_core_web_sm")

Let's see what it looks like if we try and extract entities from this sentence.

In [4]:
text = "Tom and Jerry are not friends, but they do run around a lot together in the USA."
doc = nlp(text)
for ent in doc.ents:
    print (ent.text, ent.label_)

Tom PERSON
Jerry PERSON
USA GPE


Everything is as expected. Now, let's try and add our custom component to the pipeline.

## Adding Component to Pipeline

Now that we have the model loaded, let's add our new component to the end of the pipeline. We want it to sit after the NER because our goal is to adjust the NER output.

In [5]:
nlp.add_pipe("remove_person")

<function __main__.remove_person(doc)>

## Analyze the Pipeline

In [6]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'remove_person': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],

## Testing the Pipeline

In [7]:
doc = nlp(text)
for ent in doc.ents:
    print (ent.text, ent.label_)

USA GPE


Viola. Our pipe is working. If we were to save this model and try and use it in another script, however, it would break. We would need to have our custom component in that script so that spaCy knew how to handle that pipe. This becomes highly problematic when we try and package and distribute our models in production. Let's work through how to solve that problem now. Let's go ahead and save it to disk anyway for the next stage of the process.

In [8]:
nlp.to_disk("test_pipe")

## Packaging Model with Code

First we need to make a Python file that has the component inside of it. You can see this file in the repo entitled "my_component.py" Once that file is created we can call it in our packaging command in the terminal. Because this is in a Jupyter Notebook, we can do this with an "!" in the cell below. The command works like this:
1) We call Python first so the terminal knows to execute a Python code.<br>
2) Next we specify that we are calling the spaCy library and its set of terminal commands.<br>
3) We state that we want to run package from spaCy. This is the command to package a model.<br>
4) We then specify the directory in which we want to drop the packaged model. In this case, "compiled".<br>
5) Finally, we tell package that this model is special. It takes a keyword argument --code followed by what file the code is in. In this case, my_component.py. Now, the spaCy model has had the code packaged with it.

In [9]:
!python -m spacy package test_pipe compiled --code my_component.py

running sdist
running egg_info
creating en_core_web_sm.egg-info
writing en_core_web_sm.egg-info\PKG-INFO
writing dependency_links to en_core_web_sm.egg-info\dependency_links.txt
writing entry points to en_core_web_sm.egg-info\entry_points.txt
writing requirements to en_core_web_sm.egg-info\requires.txt
writing top-level names to en_core_web_sm.egg-info\top_level.txt
writing manifest file 'en_core_web_sm.egg-info\SOURCES.txt'
reading manifest file 'en_core_web_sm.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'en_core_web_sm.egg-info\SOURCES.txt'
running check
creating en_core_web_sm-3.0.0
creating en_core_web_sm-3.0.0\en_core_web_sm
creating en_core_web_sm-3.0.0\en_core_web_sm.egg-info
creating en_core_web_sm-3.0.0\en_core_web_sm\en_core_web_sm-3.0.0
creating en_core_web_sm-3.0.0\en_core_web_sm\en_core_web_sm-3.0.0\attribute_ruler
creating en_core_web_sm-3.0.0\en_core_web_sm\en_core_web_sm-3.0.0\lemmatizer
creating en_core_web_sm-3.0.0\en_core_web_s

2021-08-03 09:06:56.082096: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll



In spaCy 2, we used to have to compile the tar.gz, but not anymore with spaCy 3. If you look in the dist folder of the model in compiled, you will notice that the tar.gz is there. This tar.gz contains the code for your custom component. You can pip install the tar.gz like any other model. Just make sure that you pip install the whole name, including the tar.gz. Once installed, that model will know how to handle your custom component perfectly.