# Processing pipelines

This chapter covers spaCy's processing pipelines. They are a series of functions applied to a Doc to add attributes like part-of-speech tags, dependency labels or named entities. Learn how to write custom components, and how to use custom attributes to add your own meta data to the documents, spans and tokens.

In [2]:
import os
import spacy

model_path = os.path.join(os.getcwd(), 'models/en_core_web_md')
nlp = spacy.load(model_path)

## Built-in pipeline components

- tagger: part-of-speech tagger
- parser: dependency parser
- ner: named entity recognizer
- textcat: text classifier

## Pipeline attributes

In [11]:
print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'custom_component']


In [8]:
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f4cb7dac210>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f4c7a4528a0>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f4c7a452b40>)]


## Why custom components?

![Pipeline](images/pipeline.png)

- Make a function execute automatically when you call nlp
- Add your own metadata to documents and tokens
- Updating built-in attributes like doc.ents

## Anatomy of a component

In [12]:
def custom_component(doc):
    # Do something to the doc here
    return doc

if 'custom_component' not in nlp.pipe_names:
  nlp.add_pipe(custom_component)
  # nlp.add_pipe(custom_component, last=True)
  # nlp.add_pipe(custom_component, first=True)
  # nlp.add_pipe(custom_component, before='ner')
  # nlp.add_pipe(custom_component, after='tagger')

## Setting custom attributes

In [20]:
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None, force=True)
Token.set_extension('is_color', default=False, force=True)
Span.set_extension('has_color', default=False, force=True)

doc = nlp('Hello!')
token = doc[0]
span = doc[0:1]

doc._.title = 'My document'
token._.is_color = True
span._.has_color = False

## Processing large volumes of text

In [22]:
# BAD
#docs = [nlp(text) for text in LOTS_OF_TEXTS]

# GOOD
#docs = list(nlp.pipe(LOTS_OF_TEXTS))

In [24]:
# Passing in context
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

This is a text 15
And another text 16


In [27]:
Doc.set_extension('id', default=None, force=True)
Doc.set_extension('page_number', default=None, force=True)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']