<a href="https://colab.research.google.com/github/timgluz/colab_notebooks/blob/master/SpacyCourse_chapter3_pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processing pipelines

processing pipelines: a series of functions applied to a doc to add attributes like part-of-speech tags, dependency labels or named entities.


![alt text](https://imgur.com/2JAJevyl.png)

All models you can load into spaCy include several files and a **meta.json**.

The meta defines things like the language and pipeline. This tells spaCy which components to instantiate.

The built-in components that make predictions also need binary data. The data is included in the model package and loaded into the component when you load the model.




In [None]:
# install the latest version of Spacy
!pip install spacy

In [None]:
# download medium size model package
#!python -m spacy download en_core_web_md

!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

In [None]:
import spacy
print(f"Spacy version: {spacy.__version__}")

nlp = spacy.load("en_core_web_md")
doc = nlp("This is a sentence")

### Pipeline attributes

To see the names of the pipeline components present in the current nlp object, you can use the nlp.pipe_names attribute.

For a list of component name and component function tuples, you can use the nlp.pipeline attribute.

The component functions are the functions applied to the doc to process it and set attributes – for example, part-of-speech tags or named entities.


`nlp.pipenames` : list of pipeline component names


In [None]:
print(nlp.pipe_names)

### Custom pipeline components

Custom pipeline components let you add your own function to the spaCy pipeline that is executed when you call the `nlp` object on a text – for example, to modify the doc and add more data to it.

Custom components are executed automatically when you call the nlp object on a text.

They're especially useful for adding your own custom metadata to documents and tokens.

You can also use them to update built-in attributes, like the named entity spans.

To specify where to add the component in the pipeline, you can use the following keyword arguments:

* Setting `last` to True will add the component last in the pipeline. This is the default behavior.

* Setting `first` to True will add the component first in the pipeline, right after the tokenizer.

* The `before` and `after` arguments let you define the name of an existing component to add the new component before or after. For example, `before="ner"` will add it before the named entity recognizer.

In [None]:
# function that takes a doc, modifies it and returns it
def custom_component(doc):
    # Do something to the doc here
    print(f"Doc length: {len(doc)}")

    return doc

# adding the custom_component to the pipe_line
# comment out if running it first time
#nlp.add_pipe(custom_component, first = True)

# print the pipeline component names
print("Pipeline:", nlp.pipe_names)

# process a text
doc = nlp("Hello world!")

#### Complex components example


In [None]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe(animal_component, after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

### Extension attributes

Custom attributes let you add any metadata to docs, tokens and spans. The data can be added once, or it can be computed dynamically.

Custom attributes are available via the `._` (dot underscore) property. This *makes it clear* that they were added by the user, and not built into spaCy, like token.text

Attributes need to be registered on the global `Doc`, `Token` and `Span` classes you can import from spacy.tokens. You've already worked with those in the previous chapters. To register a custom attribute on the `Doc`, `Token` and `Span`, you can use the `set_extension` method.

There are three types of extensions: 
* attribute extensions
* property extensions 
* and method extensions.

In [None]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension("title", default=None)
Token.set_extension("is_color", default=False)
Span.set_extension("has_color", default=False)

In [None]:
# add custom metadata to documents, tokens and spans

doc._.title = "My document"

token = doc[1]
token._.is_color = True

span = doc[0:3]
span._.has_color = False



In [None]:
# Attribute extensions
# Set extension on the Token with default value
Token.set_extension("is_color2", default=False)

doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color2 = True
print(doc[3]._.is_color2, "-", doc[3].text)

In [None]:
# Property extensions
# Define getter function
def get_is_color(token):
    colors = ["red", "yellow", "blue"]
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension("is_color3", getter=get_is_color)

doc = nlp("The sky is blue.")
print(doc[3]._.is_color3, "-", doc[3].text)

In [None]:
# Property extensions on Span

from spacy.tokens import Span

# Define getter function
def get_has_color4(span):
    colors = ["red", "yellow", "blue"]
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension("has_color4", getter=get_has_color4)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color4, "-", doc[1:4].text)
print(doc[0:2]._.has_color4, "-", doc[0:2].text)

In [None]:
# Method extensions

from spacy.tokens import Doc

# Define method with arguments
def has_token5(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension("has_token5", method=has_token5)

doc = nlp("The sky is blue.")
print(doc._.has_token5("blue"), "- blue")
print(doc._.has_token5("cloud"), "- cloud")

In [None]:
# combine custom extension attributes with the model’s predictions and create
# an attribute getter that returns a Wikipedia search URL if the span is
# a person, organization, or location.

import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)