This notebooks is created using Chapter 3 of the the [Advanced NLP with spaCy](https://course.spacy.io/en/chapter3) course

# Chapter 3: Processing Pipelines

## Processing pipelines

Welcome back! This chapter is dedicated to processing pipelines: a series of functions applied to a doc to add attributes like part-of-speech tags, dependency labels or named entities.

In this lesson, you'll learn about the pipeline components provided by spaCy, and what happens behind the scenes when you call nlp on a string of text.

### What happens when you call nlp?

You've already written this plenty of times by now: pass a string of text to the `nlp` object, and receive a `Doc` object.

But what does the `nlp` object actually do?

First, the tokenizer is applied to turn the string of text into a `Doc` object. Next, a series of pipeline components is applied to the doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed doc is returned, so you can work with it.

![pipeline](./img/pipeline.png)

In [7]:
# Import spaCy
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("This is a sentence.")

In [8]:
doc

This is a sentence.

### Built-in pipeline components

spaCy ships with a variety of built-in pipeline components. Here are some of the most common ones that you'll want to use in your projects.

The part-of-speech tagger sets the `token.tag` and `token.pos` attributes.

The dependency parser adds the `token.dep` and `token.head` attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.

The named entity recognizer adds the detected entities to the `doc.ents` property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not.

Finally, the text classifier sets category labels that apply to the whole text, and adds them to the `doc.cats` property.

Because text categories are always very specific, the text classifier is not included in any of the trained pipelines by default. But you can use it to train your own system.

|Name | Description | Creates
------|------------|-------
tagger | Part-of-speech tagger | Token.tag, Token.pos
parser | Dependency parser | Token.dep, Token.head, Doc.sents, Doc.noun_chunks
ner	| Named entity recognizer | Doc.ents, Token.ent_iob, Token.ent_type
textcat	| Text classifier | Doc.cats

### Under the hood

All pipeline packages you can load into spaCy include several files and a `config.cfg`.

The config defines things like the language and pipeline. This tells spaCy which components to instantiate and how they should be configured.

The built-in components that make predictions also need binary data. The data is included in the pipeline package and loaded into the component when you load the pipeline.

![Package Meta](./img/package_meta.png)

- Pipeline defined in model's `config.cfg` in order
- Built-in components need binary data to make predictions

### Pipeline attributes

To see the names of the pipeline components present in the current nlp object, you can use the `nlp.pipe_names` attribute.

For a list of component name and component function tuples, you can use the `nlp.pipeline` attribute.

The component functions are the functions applied to the doc to process it and set attributes – for example, part-of-speech tags or named entities.

- `nlp.pipe_names`: list of pipeline component names

In [9]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


- nlp.pipeline: list of (name, component) tuples

In [18]:
for x in nlp.pipeline:
    print(x)

('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7d470a3e5430>)
('tagger', <spacy.pipeline.tagger.Tagger object at 0x7d470a3e72f0>)
('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7d470a217df0>)
('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7d47071a85d0>)
('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7d470a44ca10>)
('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7d470a217ca0>)


## What happens with you call nlp?

What does spaCy do when you call nlp on a string of text?

In [13]:
doc = nlp("This is a sentence.")

The tokenizer turns a string of text into a Doc object. spaCy then applies every component in the pipeline on document, in order.

## Inspecting the pipeline

Let’s inspect the small English pipeline!

- Load the `en_core_web_sm` pipeline and create the `nlp` object.
- Print the names of the pipeline components using `nlp.pipe_names`.
- Print the full pipeline of `(name, component)` tuples using `nlp.pipeline`.

In [17]:
import spacy

# Load the en_core_web_sm pipeline
nlp = spacy.load('en_core_web_sm')

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
for x in nlp.pipeline:
    print(x)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7d470a3e5430>)
('tagger', <spacy.pipeline.tagger.Tagger object at 0x7d470a3e72f0>)
('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7d470a217df0>)
('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7d47071a85d0>)
('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7d470a44ca10>)
('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7d470a217ca0>)


## Custom pipeline components

### Why custom components?

After the text is tokenized and a `Doc` object has been created, pipeline components are applied in order. spaCy supports a range of built-in components, but also lets you define your own.

Custom components are executed automatically when you call the `nlp` object on a text.

They're especially useful for adding your own custom metadata to documents and tokens.

You can also use them to update built-in attributes, like the named entity spans.

![pipeline](./img/pipeline.png)

- Make a function execute automatically when you call `nlp`
- Add your own metadata to documents and tokens
- Updating built-in attributes like `doc.ents`

### Anatomy of a component(1)

Fundamentally, a pipeline component is a function or callable that takes a doc, modifies it and returns it, so it can be processed by the next component in the pipeline.

To tell spaCy where to find your custom component and how it should be called, you can decorate it using the `@Language.component` decorator. Just add it to the line right above the function definition.

Once a component is registered, it can be added to the pipeline using the `nlp.add_pipe` method. The method takes at least one argument: the string name of the component.

- Function that takes a `doc`, modifies it and returns it
- Registered using the `Language.component` decorator
- Can be added using the `nlp.add_pipe` method

In [20]:
from spacy.language import Language

@Language.component("custom_component")
def custom_component_function(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe("custom_component")

<function __main__.custom_component_function(doc)>

### Anatomy of a component(2)

To specify *where* to add the component in the pipeline, you can use the following keyword arguments:

Setting `last` to `True` will add the component last in the pipeline. This is the default behavior.

Setting `first` to `True` will add the component first in the pipeline, right after the tokenizer.

The `before` and `after` arguments let you define the name of an existing component to add the new component before or after. For example, `before="ner"` will add it before the named entity recognizer.

The other component to add the new component before or after needs to exist, though – otherwise, spaCy will raise an error.

Argument | Description | Example
-----------|-----------|--------
last | If True, add last | nlp.add_pipe("component", last=True)
first | If True, add first | nlp.add_pipe("component", first=True)
before | Add before component | nlp.add_pipe("component", before="ner")
after | Add after component | nlp.add_pipe("component", after="tagger")

### Example: a simple component(1)

Here's an example of a simple pipeline component.

We start off with the small English pipeline.

We then define the component – a function that takes a `Doc` object and returns it.

Let's do something simple and print the length of the doc that passes through the pipeline.

Don't forget to return the doc so it can be processed by the next component in the pipeline! The doc created by the tokenizer is passed through all components, so it's important that they all return the modified doc.

To tell spaCy about the new component, we register it using the `@Language.component` decorator and call it "custom_component".

We can now add the component to the pipeline. Let's add it to the very beginning right after the tokenizer by setting `first=True`.

When we print the pipeline component names, the custom component now shows up at the start. This means it will be applied first when we process a doc.

In [22]:
# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
@Language.component("custom_component")
def custom_component_function(doc):
    # Print the doc's length
    print("Doc length:", len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe("custom_component", first=True)

# Print the pipeline component names
print("Pipeline:", nlp.pipe_names)

Pipeline: ['custom_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


### Example: a simple component(2)

Now when we process a text using the nlp object, the custom component will be applied to the doc and the length of the document will be printed.

In [23]:
# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
@Language.component("custom_component")
def custom_component_function(doc):
    # Print the doc's length
    print("Doc length:", len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe("custom_component", first=True)

# Process a text
doc = nlp("Hello world!")

Doc length: 3


## Use cases for custom components

Which of these problems can be solved by custom pipeline components? Choose all that apply!

1. Updating the trained pipelines and improving their predictions
2. Computing your own values based on tokens and their attributes
3. Adding named entities, for example based on a dictionary
4. Implementing support for an additional language

Answer: 2 and 3

Custom components are great for adding custom values to documents, tokens and spans, and customizing the `doc.ents`.

## Simple components

The example shows a custom component that prints the number of tokens in a document. Can you complete it?

- Complete the component function with the `doc`’s length.
- Add the `"length_component"` to the existing pipeline as the **first** component.
- Try out the new pipeline and process any text with the `nlp` object – for example “This is a sentence.”.

In [26]:
import spacy
from spacy.language import Language

# Define the custom component
@Language.component("length_component")
def length_component_function(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe("length_component", first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence")

['length_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
This document is 4 tokens long.


## Complex components

In this exercise, you’ll be writing a custom component that uses the `PhraseMatcher` to find animal names in the document and adds the matched spans to the `doc.ents`. A `PhraseMatcher` with the animal patterns has already been created as the variable `matcher`.

- Define the custom component and apply the `matcher` to the `doc`.
- Create a `Span` for each match, assign the label ID for `"ANIMAL"` and overwrite the `doc.ents` with the new spans.
- Add the new component to the pipeline *after* the `"ner"` component.
- Process the text and print the entity text and entity label for the entities in `doc.ents`.