<a href="https://colab.research.google.com/github/tanaymukherjee/Natural-Language-Processing/blob/master/07_Processing_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processing Pipelines

## Inspecting the pipeline

In [97]:
# Create an NLP object
from spacy.lang.en import English
nlp = English()

In [98]:
# Import the Doc class
from spacy.tokens import Doc, Span

In [99]:
import spacy

In [100]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [101]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f18b2473da0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f18af891d68>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f18af891dc8>)]


## Custom pipeline components

### Simple components

In [102]:
# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc

# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

['length_component', 'tagger', 'parser', 'ner']


In [103]:
# Load the small English model and Add the component first in the pipeline
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(length_component, first=True)

# Process a text
doc = nlp("This is a sentence.")

This document is 5 tokens long.


### Complex components

In [104]:
# Load the small English model and Add the component first in the pipeline
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(length_component, first=True)

# Process a text
doc = nlp("I have a cat and a Golden Retriever")

This document is 8 tokens long.


In [105]:
# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label='ANIMAL')
             for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

# Add the component to the pipeline after the 'ner' component 
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

['length_component', 'tagger', 'parser', 'ner', 'animal_component']


In [106]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [107]:
# Create the match patterns
pattern1 = [{'LOWER': 'Golden'}, {'IS_TITLE': True, 'POS': 'NOUN'}]
pattern2 = [{'LOWER': 'Cat'}, {'TEXT': '-'}, {'LOWER': 'free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', None, pattern1)
matcher.add('PATTERN2', None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

In [108]:
# import necessary modules
import spacy
#import PhraseMatcher class
from spacy.matcher import PhraseMatcher

# Language class with the English model 'en_core_web_sm' is loaded
nlp = spacy.load('en_core_web_sm')

# create the PhraseMatcher object
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

# the list containing the pharses to be matched
terminology_list = ["Golden Retriever", "Cat"]

# convert the phrases into document object using nlp.make_doc to #speed up.
patterns = [nlp.make_doc(text) for text in terminology_list]
# add the patterns to the matcher object without any callbacks
matcher.add("Phrase Matching", None, *patterns)
# the input text string is converted to a Document object
doc = nlp("I have a cat and a Golden Retriever")

#call the matcher object the document object and it will return #match_id, start and stop indexes of the matched words
matches = matcher(doc)
#print the matched results and extract out the results
for match_id, start, end in matches:
    # Get the string representation 
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

11356100181062323261 Phrase Matching 3 4 cat
11356100181062323261 Phrase Matching 6 8 Golden Retriever


In [109]:
# import necessary packages and tools 
import spacy
from spacy.pipeline import EntityRuler # import EntityRuler 
# load a blank English model from spacy
nlp = spacy.blank('en')
# convert the input sentence into the document object 
doc = nlp("I have a cat and a Golden Retriever")
# print the entity types of each entity in the above sentence
print([(ent.text, ent.label_) for ent in doc.ents])

[]


We have imported the EntityRuler matcher and the spaCy blank English model and got the input string converted into a document object. Then we have printed the entity types of each entity available in the input. In this case, it retrieves nothing but an empty list because it is a blank model

In [110]:
# instantiate an object of EntityRuler class
ruler = EntityRuler(nlp)
# define the pattern
patterns = [{"label": "ANIMAL", "pattern": "Golden Retriever"}]

# add the pattern to the matcher object
ruler.add_patterns(patterns)

# add the matcher object as a new pipe to the model
nlp.add_pipe(ruler)
# convert the input sentence into the document object using the newly added 'nlp'
doc = nlp("I have a cat and a Golden Retriever")
# print the entities in the sentenced after adding the EntityRuler matcher
print([(ent.text, ent.label_) for ent in doc.ents])

[('Golden Retriever', 'ANIMAL')]


In [112]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [113]:
# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['Golden Retriever', 'cat', 'turtle', 'dog'], spaces=[True, True, True, False])

# Create a span for "Rafeal Nadal" from the doc and assign it the label "PERSON"
span = Span(doc, 0, 1, label='Animal')
print(span.text, span.label_)

Golden Retriever Animal


In [114]:
# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Golden Retriever', 'ANIMAL')]
