# spaCy's Pipelines

Based on **Dr. William Mattingly** video: https://www.youtube.com/watch?v=dIUTsFT2MeQ&t

and his Jupyter Book: http://spacy.pythonhumanities.com/01_04_pipelines.html

## Standard Pipes (Components and Factories) Available from spaCy

+ spaCy serves as both an NLP framework and a system for constructing and executing intricate pipelines.
+ A pipeline in spaCy is a series of pipes or actors that perform operations on data, either modifying it or extracting relevant information.
+ Some pipes in the pipeline may rely on the output generated by preceding pipes, hile others can operata independently.

The image provides an example of spaCy pipiline, showcasing the components and their flow of data

<div style="display: flex; justify-content: center">
<img src="images/spacy_03_01.png" alt="spaCy pipeline" width="700" height="275">
</div>

+ In the described pipeline, an input sentence enters from the left.
+ The pipeline consists of two activated pipes:
    + An EntityRuler, whichj is a rule-based NER that identifies entities in the sentence.
    + An EntityLinker pipe, which performs topoym resolution to determine the specific entity.
+ After passing through these pipes, the sentence is outputted with annotations indicating the identified entities.
+ The spaCy feature **doc.ents** can be used to extract the entities from the annotated sentence.
+ In more sophisticated pipelines, spaCy utilizes a Tok2Vec input layer to vectorize the input sentence. This enables machine learning pipes to make predictions based on the vectorized representations.

### Attribute Rullers

**Attribute Rullers** in spacy are components that allow you to add or modify token attributes based on specific rules. They providea a way to programmatically define and apply rules to tokens in a text. Attribute Rulers can be used to set, overwrite, or extend token attributes such as POS tags, entity labels, dependency labels, and more. They are useful for customizing and fine-tuning token attributes to suit specific NLP tasks or domain-specific requirements.

+ **Dependency Parser**: Analyzes the grammatical structure and relationships between words in a sentence.
+ **EntityLinker**: Links recognized entities to knowledge bases or ontologies for further analysis.
+ **EntityRecognizer**: Identifies and classifies named entities in text, such as person names, organizations, or locations.
+ **EntityRuler**: Matches and annotates pre-defined entities or patterns in text.
+ **Lemmatizer**: Converts words to their base or dictionary form (lemmas).
+ **Morpholog**: Provides morphological analysis of words, including their POS and grammatic features.
+ **SentenceRecognizer**: Recognizes sentence boundaries in text.
+ **Sentencizer**: Splits text into sentences.
+ **SpanCategorizer**: Assigns categories or ;abels to spans of text.
+ **Tagger**: Assigns POS tags to words.
+ **TextCategorizer**: Classifies text into predefined categories or labels.
+ **Tok2Vec**: Converts text into numeric vectors using token-level representations.
+ **Tokenizer**: Splits text into individual tokens (words, punctuation, etc.).
+ **TrainablePipe**: A trainable processing component that can be customized and fine-tuned for specific tasks.
+ **Transformer**: Applies a transformed-based model for text processing, such as language modeling or text classification.

### Matchers

**Matchers** in spaCy are components that allow you to define rules to match and extract linguistic patterns or structures from text. They are designed to identify and extract specific combinations of tokens, based on criteria such as token attributes, POS tags, dependencies, and more. Matchers can be used to extract entites, noun phrases, verb phrases, or any other specific patterns of interest from the text. They provide a powerful tool for pattern-based information extraction and can be combined with other spaCy components to build complex NLP pipelines. Matchers in spaCy include PhraseMatcher, DependencyMatcher, and TokenMatcher.

+ **DependencyMatcher**: DependencyMatcher is a matcher in spaCy that allows you to define rules based on dependency relationships between tokens in a sentence. it provides a way to match and extract subtrees or patterns based on the syntactic dependencies of tokens. DependencyMatcher can be useful for extracting specific syntactic structures or capturing relationships between tokens in a sentence.
+ **Matcher**: Matcher is a versatile matcher in spaCy that allows you to define rules based on token attributes, POS tags, dependencies, and more. It provides a way to match and extract specific combinations of tokens based on the defined criteria. Matcher can be used for various tasks such as entity recognition, phrase extraction, or any other pattern-based extraction from the text.
+ **PhraseMatcher**: PharseMatcher is a matcher in spaCy that allows you to define rules based on sequences of tokens or phrases. It provides a way to match and extract specific phrases or sequences of interest from the text. PhraseMatcher is useful for tasks such as NER, extracting predefined phrases, or capturing specific linguistic patterns in the text.

## Adding Pipes

+ In most scenarios, utilizing an existing pre-trained spaCy model is sufficient.
+ However, there are cases where an off-the-shelf model may not meet specific requirements or may perform certain tasks slowly.
+ An example is sentence tokenization, particularly when dealing with a lengthy document containing a large number of sentences.
+ Even with the small English model, processing a large number of sentences can be time-consuming.
+ To address this, a blank English model can be created, and only the Sentencizer component needs to be added to it.
+ By doing so, unnecessary computational resources and time are saved since only the Sentencizer compoent will be activated and perform the sentence tokenization task.
+ This approach can significantly reduce the processing time from potentially hours using the small model to just minutes.

In [1]:
import spacy

In [3]:
# To create blank model with language we have to pass the two letter combination
nlp = spacy.blank("en")

In [4]:
# adding pipe
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x10f936dc0>

In [5]:
import requests
from bs4 import BeautifulSoup
s = requests.get("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt")
soup = BeautifulSoup(s.content).text.replace("-\n", "").replace("\n", " ")
nlp.max_length = 5278439

In [8]:
%%time
doc1 = nlp(soup)
print(len(list(doc1.sents)))

94134
CPU times: user 5.78 s, sys: 64.8 ms, total: 5.85 s
Wall time: 5.84 s


In [7]:
nlp2 = spacy.load("en_core_web_sm")
nlp2.max_length = 5278439

In [9]:
%%time
doc2 = nlp2(soup)
print(len(list(doc2.sents)))

85957
CPU times: user 33.8 s, sys: 14 s, total: 47.9 s
Wall time: 56.9 s


The difference in time is almost ten times longer.

Often times when we need to find sentences quickly, not necessarily accurately.

## Examining a Pipeline

Examining a pipeline in NLP involves inspecting the components and stages of a sequence of processing steps applied to text data. It includes understanding the data flow, order of processing steps, and output generated at each stage. By examining the pipeline, one can gain insights into how the data is processed, identify techniques and algorithms used, and make informes decisions for better performance. Popular NLP libraries like spaCy and NLTK provide modules for inspecting and understanding the pipeline's components and their attributes.

In [11]:
nlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

In [12]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

* The dictionary structure provides informaton about the pipeline components and their order.
+ Each key after "summary" represent a pipe in the pipeline, and the corresponding value is a dictionary providing additional details.
+ The value dictionary for each pipe contains the following information:
    + **assigns**: Specifies the value that the particular pipe assigns to the token and doc as they pass through the pipeline.
    + In some cases, there may be a key named **scores** in the dictionary.
        + This indicates how the machine learning model was evaluated.

In [13]:
nlp3 = spacy.load("en_core_web_md")

In [15]:
nlp3.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att