## Inspecting the pipeline

Let’s inspect the small English model’s pipeline!

- Load the en_core_web_sm model and create the nlp object.
- Print the names of the pipeline components using nlp.pipe_names.
- Print the full pipeline of (name, component) tuples using nlp.pipeline.

In [3]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x0000018EFC3F7100>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x0000018EFC3F7A60>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x0000018EFC1674A0>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x0000018EFC40DB40>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x0000018EFC403E40>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x0000018EFC167660>)]


The example shows a custom component that prints the number of tokens in a document. Can you complete it?

- Complete the component function with the doc’s length.
- Add the length_component to the existing pipeline as the first component.
- Try out the new pipeline and process any text with the nlp object – for example “This is a sentence.”.

In [12]:
import spacy
from spacy.language import Language

# Define the custom component
@Language.component("info_component")
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe("info_component",name="length_component",first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence.")

['length_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
This document is 5 tokens long.


In this exercise, you’ll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents. A PhraseMatcher with the animal patterns has already been created as the variable matcher.

- Define the custom component and apply the matcher to the doc.
- Create a Span for  each match, assign the label ID for "ANIMAL" and overwrite the doc.ents with the new spans.
- Add the new component to the pipeline after the "ner" component.
- Process the text and print the entity text and entity label for the entities in doc.ents.

In [20]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
from spacy.language import Language

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe("animal_component",name="animal_component",after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]
