<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/Custom_pipelines_and_extensions_for_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom pipelines and extensions for spaCy

This notebook includes the code of the blog entry in the spaCy blog: https://explosion.ai/blog/spacy-v2-pipelines-extensions

Here we describe how customize pipelines and add extension in spaCy.

If it is the first time, you should install the spacy model.

In [1]:
! python -m spacy download es_core_news_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')


**Note**: after the installation, you should re-initialize the environment.

In [0]:
import spacy

In [0]:
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("es_core_news_sm")

## Custom pipeline components

In [4]:
print('Default PIPELINE components: ')
nlp.pipeline

Default PIPELINE components: 


[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fcce0dcff60>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fcce0d4c9a8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fcce0d4ca08>)]

At a minimum, a component needs to be a callable that takes a Doc and returns it:

In [0]:
def my_component(doc):
  print("The doc is {} characters long and has {} tokens.".format(len(doc.text), len(doc)))
  return doc

def my_component2(doc):
  print("The first word in the text is '{}'.".format(doc[0]))
  return doc

def my_component3(doc):
  print("This is a component executed after the pipe 'parser'")
  return doc

The component can then be added at any position of the pipeline using the nlp.add_pipe() method. The arguments before, after, first, and last let you specify component names to insert the new component before or after, or tell spaCy to insert it first (i.e. directly after tokenization) or last in the pipeline.

In [6]:
#Add your components to the nlp pipeline in the order you want
if 'print_length' not in nlp.pipe_names:
  nlp.add_pipe(my_component, name='print_length', last=True)
if 'print_first' not in nlp.pipe_names:
  nlp.add_pipe(my_component2, name='print_first', first=True)
if 'after_parser' not in nlp.pipe_names:
  nlp.add_pipe(my_component3, name='after_parser', after='parser')

print('New PIPELINE components: ')
nlp.pipeline

New PIPELINE components: 


[('print_first', <function __main__.my_component2>),
 ('tagger', <spacy.pipeline.pipes.Tagger at 0x7fcce0dcff60>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fcce0d4c9a8>),
 ('after_parser', <function __main__.my_component3>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fcce0d4ca08>),
 ('print_length', <function __main__.my_component>)]

We test the components are callable during the spacy execution.

In [7]:
doc = nlp(u"Esta es una frase.")

The first word in the text is 'Esta'.
This is a component executed after the pipe 'parser'
The doc is 18 characters long and has 5 tokens.


We remove the components

In [8]:
if 'print_length' in nlp.pipe_names:
  nlp.remove_pipe('print_length')
if 'print_first' in nlp.pipe_names:
  nlp.remove_pipe('print_first')
if 'after_parser' in nlp.pipe_names:
  nlp.remove_pipe('after_parser')

print('New PIPELINE components: ')
nlp.pipeline

New PIPELINE components: 


[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fcce0dcff60>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fcce0d4c9a8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fcce0d4ca08>)]

## Extension attributes on Doc, Token and Span

spaCy v2.0 introduces a new mechanism that lets you register your own attributes, properties and methods that become available in the ._ namespace, for example, doc._.my_attr. There are mostly three types of extensions that can be registered via the set_extension() method:


1.   Attribute extensions. Set a default value for an attribute, which can be overwritten.
2.   Property extensions. Define a getter and an optional setter function.
3.   Method extensions. Assign a function that becomes available as an object method.

In [9]:
from spacy.tokens import Doc

def get_value(doc):
  return 'value'

def set_value(value):
  print(f"Set Value {value}")

Doc.set_extension('hello_attr', default=True, force=True)
Doc.set_extension('hello_property', getter=get_value, setter=set_value, force=True)
Doc.set_extension('hello_method', method=lambda doc, name: 'Hi {}!'.format(name), force=True)

print(doc._.hello_attr)            # True
print(doc._.hello_property)        # return value of get_value
print(doc._.hello_method('Ines'))  # 'Hi Ines!'

True
value
Hi Ines!


The following example shows a simple pipeline component that fetches all countries using the REST Countries API, finds the country names in the document, merges the matched spans, assigns the entity label GPE (geopolitical entity) and adds the country's capital, latitude/longitude coordinates and a boolean is_country to the token attributes.

In [0]:
import requests
from spacy.tokens import Token, Span
from spacy.matcher import PhraseMatcher

class Countries(object):
    name = 'countries'  # component name shown in pipeline

    def __init__(self, nlp, label='GPE'):
        # request all country data from the API
        r = requests.get('https://restcountries.eu/rest/v2/all')
        self.countries = {c['name']: c for c in r.json()}  # create dict for easy lookup
        # initialise the matcher and add patterns for all country names
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('COUNTRIES', None, *[nlp(c) for c in self.countries.keys()])
        self.label = nlp.vocab.strings[label] # get label ID from vocab
        # register extensions on the Token
        if not Token.has_extension('is_country'):
          Token.set_extension('is_country', default=False)
        if not Token.has_extension('country_capital'):
          Token.set_extension('country_capital', default="")
        if not Token.has_extension('country_latlng'):
          Token.set_extension('country_latlng', default="")

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # create Span for matched country and assign label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            for token in entity:  # set values of token attributes
                token._.set('is_country', True)
                token._.set('country_capital', self.countries[entity.text]['capital'])
                token._.set('country_latlng', self.countries[entity.text]['latlng'])
        doc.ents = list(doc.ents) + spans  # overwrite doc.ents and add entities – don't replace!
        for span in spans:
            span.merge()  # merge all spans at the end to avoid mismatched indices
        return doc  # don't forget to return the Doc!

In [12]:
component = Countries(nlp)
nlp.add_pipe(component, before='tagger')
doc = nlp(u"texto sobre Colombia y la Czech Republic")

print([(ent.text, ent.label_) for ent in doc.ents])
# [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]

print([(token.text, token._.country_capital) for token in doc if token._.is_country])
# [('Colombia', 'Bogotá'), ('Czech Republic', 'Prague')]

ValueError: ignored

Using getters and setters, you can also implement attributes on the Doc and Span that reference custom Token attributes – for example, whether a document contains countries. 

In [0]:
has_country = lambda tokens: any([token._.is_country for token in tokens])
Doc.set_extension('has_country', getter=has_country)
Span.set_extension('has_country', getter=has_country)

In [20]:
print(doc)
print(f' - Exist a country in the text = {doc._.has_country}')

Texto sobre Colombia y la Czech Republic
Exist a country in the text True


In [21]:
span_ = doc[0:3]
print(span_)
print(f' - Exist a country in the span = {span_._.has_country}')

Texto sobre Colombia
 - Exist a country in the span = True


In [22]:
span_ = doc[0:2]
print(span_)
print(f' - Exist a country in the span = {span_._.has_country}')

Texto sobre
 - Exist a country in the span = False


## Example: Emoji handling with spacymoji

In [24]:
! pip install spacymoji

Collecting spacymoji
  Downloading https://files.pythonhosted.org/packages/e6/42/b4460030eb06504451973609ab8a95b8c0106090aca7a4d657baf9b2611d/spacymoji-2.0.0-py3-none-any.whl
Collecting emoji<1.0.0,>=0.4.5
[?25l  Downloading https://files.pythonhosted.org/packages/40/8d/521be7f0091fe0f2ae690cc044faf43e3445e0ff33c574eae752dd7e39fa/emoji-0.5.4.tar.gz (43kB)
[K     |████████████████████████████████| 51kB 2.5MB/s 
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-0.5.4-cp36-none-any.whl size=42175 sha256=f7d5d5f374dcff11680af29f2d66a4258cbb47ea66ef32bdd8cb0669acecb623
  Stored in directory: /root/.cache/pip/wheels/2a/a9/0a/4f8e8cce8074232aba240caca3fade315bb49fac68808d1a9c
Successfully built emoji
Installing collected packages: emoji, spacymoji
Successfully installed emoji-0.5.4 spacymoji-2.0.0


In [0]:
import spacy
from spacymoji import Emoji

nlp = spacy.load('en')
emoji = Emoji(nlp)
nlp.add_pipe(emoji, first=True)

doc  = nlp(u"This is a test 😻 👍🏿")
assert doc._.has_emoji
assert len(doc._.emoji) == 2
assert doc[2:5]._.has_emoji
assert doc[4]._.is_emoji
assert doc[5]._.emoji_desc == u'thumbs up dark skin tone'
assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')

In [26]:
doc[4]._.emoji_desc

'smiling cat face with heart-eyes'