# Simple Component

The example shows a custom component that prints the token length of a document. Can you complete it?

1. Complete the component function with the doc’s length.
2. Add the length_component to the existing pipeline as the first component.
3. Try out the new pipeline and process any text with the nlp object – for example “This is a sentence.”.

In [6]:
import spacy

# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component,first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp('This is a sentence.')

['length_component', 'tagger', 'parser', 'ner']
This document is 5 tokens long.


# Complex Component

In this exercise, you’ll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents. A PhraseMatcher with the animal patterns has already been created as the variable matcher.

1. Define the custom component and apply the matcher to the doc.
2. Create a Span for each match, assign the label ID for 'ANIMAL' and overwrite the doc.ents with the new spans.
3. Add the new component to the pipeline after the 'ner' component.
4. Process the text and print the entity text and entity label for the entities in doc.ents.

In [1]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

In [2]:
nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]


In [3]:
# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label='ANIMAL') for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

In [4]:
# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'animal_component']


In [5]:
# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]
