## Processing Pipelines

### Built in pipeline components: 

| Name        | Description            | Create                                           |
| :---------- |:----------------------:|:------------------------------------------------:|
| tagger      | Part-of-speech tagger  |Token.tag                                         |
| parser      | Dependency parser|Toekn.dep, Token.head, Doc.sents, Doc.noun_chunks |
| textcat     | Text Classifier        |Doc.cats                                          |
| ner     | Named entity recognizer        |Doc.ents, Token.ent_iob, Token.ent_type  |

### Pipeline Attributes : 

In [2]:
import spacy 
nlp = spacy.load('en_core_web_sm')

In [2]:
print(nlp.pipe_names)

['tagger', 'parser', 'ner']


In [3]:
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x000001D8150EBE10>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x000001D8191EEFA8>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x000001D819208048>)]


### Custome Pipeline Components : 

In [5]:
# adding a component into nlp pipepline
import spacy 
nlp = spacy.load('en_core_web_sm')

# Define a custom component
def custome_component(doc):
    # Print a doc length
    print('Doc length: ',len(doc))
    # return the doc object
    return doc

# Add the component in the first of the nlp pipeline
nlp.add_pipe(custome_component, first=True)

# Print the name of the pipes
print("Pipelines: ",nlp.pipe_names)
print(nlp.pipeline)

Pipelines:  ['custome_component', 'tagger', 'parser', 'ner']
[('custome_component', <function custome_component at 0x000001D81AAB2AE8>), ('tagger', <spacy.pipeline.pipes.Tagger object at 0x000001D81AA2F518>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x000001D81DD6DC48>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x000001D81DD6DCA8>)]


In [6]:
# process text using custome component with nlp
doc = nlp("Hello World!")

Doc length:  3


### Example of Custom Component:

In [7]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label='ANIMAL') for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


In [10]:
import json
from spacy.lang.en import English
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

with open("countries.json") as f:
    COUNTRIES = json.loads(f.read())

with open("capitals.json") as f:
    CAPITALS = json.loads(f.read())

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))


def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label='GPE') for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute 'capital' with the getter get_capital
Span.set_extension('capital', getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


In [13]:
import json
from spacy.lang.en import English
from spacy.tokens import Doc

with open("bookquotes.json") as f:
    DATA = json.loads(f.read())

nlp = English()

# Register the Doc extension 'author' (default None)
Doc.set_extension('author', default=None)

# Register the Doc extension 'book' (default None)
Doc.set_extension('book', default=None)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['author']
    doc._.author = context['book']

    # Print the text and custom attribute data
    print(doc.text, "\n", "— '{}' by {}".format(doc._.book, doc._.author), "\n")

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Franz Kafka' by Metamorphosis 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Herman Melville' by Moby-Dick or, The Whale 

It was the best of times, it was the worst of times. 
 — 'Charles Dickens' by A Tale of Two Cities 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'Jack Kerouac' by On the Road 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — 'George Orwell' by 1984 

Nowadays people know the price of everything and the value of nothing. 
 — 'Oscar Wilde' by The Picture Of Dorian Gray 

