# Build a text processing pipeline with spaCy

- Installation
- Download pre-trained Models
- Doc/Token/Span
- Token Extracting / Removing / Transforming (Tokenization, Lemmatization)
- Sentence Segmentation
- Part of Speech Tagging
- Named Entity Recognition
- Dependency Parsing
- Word Vectors
- Sentence Similarity
- Customize pipeline components
- How does pipeline works

# Installation 

- spaCy is compatible with 64-bit CPython 2.7 / 3.5+ 
- Runs on Unix/Linux, macOS/OS X and Windows
- The latest spaCy releases are available over pip and conda

In [1]:
# !pip install -U spacy

In [2]:
# !conda install -c conda-forge spacy

# Download pre-trained Models

The easiest way to download a model is via spaCy’s download command.  

It takes care of finding the best-matching model compatible with your spaCy installation.

In [3]:
# !python -m spacy download en_core_web_sm

spaCy’s models can also be installed as Python packages.

In [4]:
# With external URL
# !pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz

# With local file
# !pip install /Users/you/en_core_web_sm-2.2.5.tar.gz

- Configuration
- Binary weights
- Lexical entries
- Data files
- Word vectors

    Directory structure
    ├── en_core_web_sm-2.2.5
    │   ├── accuracy.json
    │   ├── meta.json
    │   ├── ner
    │   │   ├── cfg
    │   │   ├── model
    │   │   └── moves
    │   ├── parser
    │   ├── tagger
    │   ├── tokenizer
    │   └── vocab
    │       ├── key2row
    │       ├── lexemes.bin
    │       ├── lookups.bin
    │       ├── strings.json
    │       └── vectors
    ├── __init__.py

# Doc/Token/Span

In [5]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [6]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [7]:
doc = nlp("Hello World!")
[token.text for token in doc]

['Hello', 'World', '!']

In [8]:
first_token = doc[0]
print(first_token.text)

Hello


In [9]:
span = doc[0:2]
[token.text for token in span]

['Hello', 'World']

# Tokenization

### using white space

In [10]:
text = "Rare bird’s detection highlights promise of ‘environmental DNA’"

In [11]:
token_lsit = text.split()
for token in token_lsit:
    print(token)

Rare
bird’s
detection
highlights
promise
of
‘environmental
DNA’


### using spaCy

In [12]:
doc = nlp(text)

In [13]:
for token in doc:
    print(token.text)

Rare
bird
’s
detection
highlights
promise
of
‘
environmental
DNA
’


# Lemmatization

Lemmatization is the process of reducing inflected forms, sometimes derivationally related forms of a word to a common base form. This reduced form or root word is called a lemma.

In [14]:
text = "am are is"
[token.lemma_ for token in nlp(text)]

['be', 'be', 'be']

In [15]:
text = "look looks looked"
doc = nlp(text)
for token in doc:
    print("token:{} -> lemma:{}".format(token.text,token.lemma_ ))

token:look -> lemma:look
token:looks -> lemma:look
token:looked -> lemma:look


In [16]:
text = "Syracuse University has never been bound by convention."

In [17]:
doc = nlp(text)

In [18]:
for token in doc:
    print("token:{} -> lemma:{}".format(token.text,token.lemma_ ))

token:Syracuse -> lemma:Syracuse
token:University -> lemma:University
token:has -> lemma:have
token:never -> lemma:never
token:been -> lemma:be
token:bound -> lemma:bind
token:by -> lemma:by
token:convention -> lemma:convention
token:. -> lemma:.


# Word Frequency

In [19]:
from collections import Counter

In [20]:
text = '''Since 1870, Syracuse University has upheld the idea that “brains and heart shall have a fair chance” at achieving a University degree. We were pioneers in recognizing women’s achievements, the academic validity of the fine arts, the emergence of information studies and the importance of educational opportunities that serve veterans. Our founding values are embedded in 150 years of educational innovation, expansion, discovery and change. See a timeline of events in Syracuse University's 150 years of history.'''
doc = nlp(text)
words =  [token.text for token in doc if token.is_punct != True]

In [21]:
word_counter = Counter(words)
word_counter.most_common(5)

[('of', 6), ('the', 5), ('University', 3), ('and', 3), ('a', 3)]

# Token Extracting / Removing / Transforming


|Attribute Name	|Type|Description                                                                    |
|:--------------------:|:---------:|:------------------------------------------------------------------------------------------------------------------------------------:|
| lemma              | int     | Base form of the token, with no inflectional suffixes.                                                                             |
| lemma_             | unicode | Base form of the token, with no inflectional suffixes.                                                                             |
| norm               | int     | The token’s norm, i.e. a normalized form of the token text. Usually set in the language’s tokenizer exceptions or norm exceptions. |
| norm_              | unicode | The token’s norm, i.e. a normalized form of the token text. Usually set in the language’s tokenizer exceptions or norm exceptions. |
| lower              | int     | Lowercase form of the token.                                                                                                       |
| lower_             | unicode | Lowercase form of the token text. Equivalent to Token.text.lower().                                                                |
| shape              | int     | Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “dd”.                                      |
| shape_             | unicode | Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “dd”.                                      |
| prefix             | int     | Hash value of a length-N substring from the start of the token. Defaults to N=1.                                                   |
| prefix_            | unicode | A length-N substring from the start of the token. Defaults to N=1.                                                                 |
| suffix             | int     | Hash value of a length-N substring from the end of the token. Defaults to N=3.                                                     |
| suffix_            | unicode | Length-N substring from the end of the token. Defaults to N=3.                                                                     |
| is_alpha           | bool    | Does the token consist of alphabetic characters? Equivalent to token.text.isalpha().                                               |
| is_ascii           | bool    | Does the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text).                                   |
| is_digit           | bool    | Does the token consist of digits? Equivalent to token.text.isdigit().                                                              |
| is_lower           | bool    | Is the token in lowercase? Equivalent to token.text.islower().                                                                     |
| is_upper           | bool    | Is the token in uppercase? Equivalent to token.text.isupper().                                                                     |
| is_title           | bool    | Is the token in titlecase? Equivalent to token.text.istitle().                                                                     |
| is_punct           | bool    | Is the token punctuation?                                                                                                          |
| is_left_punct      | bool    | Is the token a left punctuation mark, e.g. (?                                                                                      |
| is_right_punct     | bool    | Is the token a right punctuation mark, e.g. )?                                                                                     |
| is_space           | bool    | Does the token consist of whitespace characters? Equivalent to token.text.isspace().                                               |
| is_bracket         | bool    | Is the token a bracket?                                                                                                            |
| is_quote           | bool    | Is the token a quotation mark?                                                                                                     |
| is_currency V2.0.8 | bool    | Is the token a currency symbol?                                                                                                    |
| like_url           | bool    | Does the token resemble a URL?                                                                                                     |
| like_num           | bool    | Does the token represent a number? e.g. “10.9”, “10”, “ten”, etc.                                                                  |
| like_email         | bool    | Does the token resemble an email address?                                                                                          |

### Extracting

Get all the punctuations

In [22]:
text='''The future opens wide for those who can master our ever-evolving,
data-driven world – come learn that mastery from the experts
at the Syracuse University iSchool!'''

In [23]:
num_list=[]
doc = nlp(text)
for token in doc:
    if token.is_punct:
        num_list.append(token)
for punc in num_list:
    print(punc)

-
,
-
–
!


### Removing

Remove the stop words

In [24]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [25]:
len(spacy_stopwords)

326

In [26]:
list(spacy_stopwords)[:8]

['twenty', 'front', "'re", 'n‘t', 'another', 'mostly', 'am', 'through']

In [27]:
doc = nlp("The quick brown fox jumps over the lazy dog")

In [28]:
no_stop_words = [token for token in doc if not token.is_stop]

In [29]:
no_stop_words

[quick, brown, fox, jumps, lazy, dog]

### Transforming

Convert text to lowercase

In [30]:
doc = nlp("The quick brown fox jumps over the lazy dog")

In [31]:
lowercased = [ token.lower_ for token in doc]

In [32]:
lowercased

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

# Token Attributes

In [33]:
import pandas as pd

In [34]:
cols = ("text", "lemma_","is_punct", "is_stop", "is_alpha", "is_space", "lower_")

In [35]:
rows = [] 
for t in doc:
    row = [t.text, t.lemma_,  t.is_punct,  t.is_stop,  t.is_alpha,  t.is_space,  t.lower_]
    rows.append(row)
attri_pdf = pd.DataFrame(rows, columns=cols)

In [36]:
attri_pdf

Unnamed: 0,text,lemma_,is_punct,is_stop,is_alpha,is_space,lower_
0,The,the,False,True,True,False,the
1,quick,quick,False,False,True,False,quick
2,brown,brown,False,False,True,False,brown
3,fox,fox,False,False,True,False,fox
4,jumps,jump,False,False,True,False,jumps
5,over,over,False,True,True,False,over
6,the,the,False,True,True,False,the
7,lazy,lazy,False,False,True,False,lazy
8,dog,dog,False,False,True,False,dog


# Sentence Segmentation

In [37]:
text = '''In some cases, eDNA analyses are being used to enforce policy. For example, in 2014, the UK government approved the use of eDNA analysis for detecting the endangered great crested newt in land-use surveys that are required by law.'''

In [38]:
text

'In some cases, eDNA analyses are being used to enforce policy. For example, in 2014, the UK government approved the use of eDNA analysis for detecting the endangered great crested newt in land-use surveys that are required by law.'

In [39]:
doc = nlp(text)

In [40]:
for sent in doc.sents:
    print("start_pos={}, end_pos={}, text:{}".format(sent.start, sent.end, sent.text))

start_pos=0, end_pos=13, text:In some cases, eDNA analyses are being used to enforce policy.
start_pos=13, end_pos=46, text:For example, in 2014, the UK government approved the use of eDNA analysis for detecting the endangered great crested newt in land-use surveys that are required by law.


# Part of Speech Tagging

In [41]:
doc = nlp("The quick brown fox jumps over the lazy dog")

In [42]:
for token in doc:
    print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

The DT DET determiner
quick JJ ADJ adjective
brown JJ ADJ adjective
fox NN NOUN noun, singular or mass
jumps NNS NOUN noun, plural
over IN ADP conjunction, subordinating or preposition
the DT DET determiner
lazy JJ ADJ adjective
dog NN NOUN noun, singular or mass


# Dependency Parsing

In [43]:
doc = nlp("The quick brown fox jumps over the lazy dog")
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

The/DT <--det-- jumps/NNS
quick/JJ <--amod-- jumps/NNS
brown/JJ <--amod-- jumps/NNS
fox/NN <--compound-- jumps/NNS
jumps/NNS <--ROOT-- jumps/NNS
over/IN <--prep-- jumps/NNS
the/DT <--det-- dog/NN
lazy/JJ <--amod-- dog/NN
dog/NN <--pobj-- over/IN


In [44]:
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

# Named Entity Recognition

| TAG | ID | DESCRIPTION                           |
|:-----:|:----:|:---------------------------------------:|
| I   | 1  | Token is inside an entity.            |
| O   | 2  | Token is outside an entity.           |
| B   | 3  | Token begins an entity.               |
|     | 0  | No entity tag is set (missing value). |

In [45]:
text='''We’re bringing the celebration of Syracuse University’s 150 years of impact to Chicago'''

In [46]:
print(text)

We’re bringing the celebration of Syracuse University’s 150 years of impact to Chicago


In [47]:
doc = nlp(text)

In [48]:
for ent in doc.ents:
    print("{}, [{},{}), {}".format(ent.text, ent.start_char, ent.end_char, ent.label_))
    for token in ent.as_doc():        
        print("    {} {} {}".format(token, token.ent_iob_, token.ent_type_))

Syracuse University, [34,53), ORG
    Syracuse B ORG
    University I ORG
150 years, [56,65), DATE
    150 B DATE
    years I DATE
Chicago, [79,86), GPE
    Chicago B GPE


- __text__ gives the Unicode text representation of the entity.
- __start_char__ denotes the character offset for the start of the entity.
- __end_char__ denotes the character offset for the end of the entity.
- __label___ gives the label of the entity.

In [49]:
from spacy import displacy
displacy.render(doc, style="ent")

# Word Vectors

In [50]:
#python -m spacy download en_core_web_lg
# Doc.similarity(), Span.similarity() and Token.similarity()

In [51]:
nlp_lg = spacy.load('en_core_web_lg')

apple = nlp_lg.vocab["apple"]
banana = nlp_lg.vocab["banana"]
car = nlp_lg.vocab["car"]

In [52]:
apple.vector

array([-3.6391e-01,  4.3771e-01, -2.0447e-01, -2.2889e-01, -1.4227e-01,
        2.7396e-01, -1.1435e-02, -1.8578e-01,  3.7361e-01,  7.5339e-01,
       -3.0591e-01,  2.3741e-02, -7.7876e-01, -1.3802e-01,  6.6992e-02,
       -6.4303e-02, -4.0024e-01,  1.5309e+00, -1.3897e-02, -1.5657e-01,
        2.5366e-01,  2.1610e-01, -3.2720e-01,  3.4974e-01, -6.4845e-02,
       -2.9501e-01, -6.3923e-01, -6.2017e-02,  2.4559e-01, -6.9334e-02,
       -3.9967e-01,  3.0925e-02,  4.9033e-01,  6.7524e-01,  1.9481e-01,
        5.1488e-01, -3.1149e-01, -7.9939e-02, -6.2096e-01, -5.3277e-03,
       -1.1264e-01,  8.3528e-02, -7.6947e-03, -1.0788e-01,  1.6628e-01,
        4.2273e-01, -1.9009e-01, -2.9035e-01,  4.5630e-02,  1.0120e-01,
       -4.0855e-01, -3.5000e-01, -3.6175e-01, -4.1396e-01,  5.9485e-01,
       -1.1524e+00,  3.2424e-02,  3.4364e-01, -1.9209e-01,  4.3255e-02,
        4.9227e-02, -5.4258e-01,  9.1275e-01,  2.9576e-01,  2.3658e-02,
       -6.8737e-01, -1.9503e-01, -1.1059e-01, -2.2567e-01,  2.41

In [53]:
apple.similarity(banana)

0.5831844

In [54]:
banana.similarity(apple)

0.5831844

In [55]:
apple.similarity(car)

0.21747091

In [56]:
#from scipy.spatial.distance import cosine

In [73]:
import numpy as np
def cosine(x,y):
    return np.dot(x,y) / (np.sqrt(np.dot(x,x)) * np.sqrt(np.dot(y,y)))

In [74]:
man = nlp_lg.vocab["man"].vector
woman = nlp_lg.vocab["woman"].vector
king = nlp_lg.vocab["king"].vector
queen = nlp_lg.vocab["queen"].vector

In [75]:
cosine(king-man+woman,queen)

0.7880845

# Sentence Similarity

By default, Token.vector returns the vector for its underlying Lexeme, while Doc.vector and Span.vector return an average of the vectors of their tokens

In [76]:
doc1 = nlp_lg("The quick brown fox jumps over the lazy dog")
doc2 = nlp_lg("The lazy dog jumps over the quick brown fox")

In [77]:
doc1.similarity(doc1)

1.0

In [78]:
doc1_vec=doc1.vector
doc2_vec=doc2.vector

In [79]:
cosine(doc2_vec,doc1_vec)

1.0

In [80]:
cosine(doc2_vec,doc1_vec)

1.0

In [81]:
doc3 = nlp("I like snow")
doc4 = nlp("I hate snow")

In [82]:
cosine(doc3.vector,doc4.vector)

0.85214496

# Processing pipelines

<center><img style="text-align:center;" src="img/pipeline.png"></center>

| NAME      | COMPONENT         | CREATES                                             | DESCRIPTION                                      |
|:-----------|:-------------------|:-----------------------------------------------------|:--------------------------------------------------|
| tokenizer | Tokenizer         | Doc                                                 | Segment text into tokens.                        |
| tagger    | Tagger            | Doc[i].tag                                          | Assign part-of-speech tags.                      |
| parser    | DependencyParser  | Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunks | Assign dependency labels.                        |
| ner       | EntityRecognizer  | Doc.ents, Doc[i].ent_iob, Doc[i].ent_type           | Detect and label named entities.                 |
| textcat   | TextCategorizer   | Doc.cats                                            | Assign document labels.                          |
| …         | custom components | Doc._.xxx, Token._.xxx, Span._.xxx                  | Assign custom attributes, methods or properties. |

# How does pipeline works

nlp = spacy.load("en_core_web_sm")
spaCy model consists of three components: 
- the weights, i.e. binary data loaded in from a directory, 
- a pipeline of functions called in order, and 
- language data like the tokenization rules and annotation scheme.

In [None]:
      # meta.json
      {
        "lang":"en",
        "name":"core_web_sm",
        "pipeline":["tagger", "parser", "ner"],
        ...
      }

In [None]:
    lang = "en"
    pipeline = ["tagger", "parser", "ner"]
    data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"

    cls = spacy.util.get_lang_class(lang)   # 1. Get Language instance, e.g. English()
    nlp = cls()                             # 2. Initialize it
    for name in pipeline:
        component = nlp.create_pipe(name)   # 3. Create the pipeline components
        nlp.add_pipe(component)             # 4. Add the component to the pipeline
    nlp.from_disk(model_data_path)          # 5. Load in the binary data

In [None]:
class Language(object):
    def __init__(
        self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs
    ):
        self.vocab = vocab
        self.tokenizer = make_doc
        self.pipeline = []
        self.max_length = max_length
        self._optimizer = None
    def make_doc(self, text):
        return self.tokenizer(text)
    def pipe_names(self):
        return [pipe_name for pipe_name, _ in self.pipeline]

In [None]:
    def add_pipe(
        self, component, name=None, before=None, after=None, first=None, last=None
    ):
        pipe = (name, component)
        if last or not any([first, before, after]):
            self.pipeline.append(pipe)
        elif first:
            self.pipeline.insert(0, pipe)
        elif before and before in self.pipe_names:
            self.pipeline.insert(self.pipe_names.index(before), pipe)
        elif after and after in self.pipe_names:
            self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
        else:
            raise ValueError(
                Errors.E001.format(name=before or after, opts=self.pipe_names)
            )

In [None]:
    def __call__(self, text, disable=[], component_cfg=None):
        doc = self.make_doc(text)
        for name, proc in self.pipeline:
            if name in disable:
                continue
            if not hasattr(proc, "__call__"):
                raise ValueError(Errors.E003.format(component=type(proc), name=name))
            doc = proc(doc, **component_cfg.get(name, {}))
            if doc is None:
                raise ValueError(Errors.E005.format(name=name))
        return doc


In [85]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [86]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f3d3c46c2e8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f3d06ae4be8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f3d06ae4c48>)]

## Disabling and modifying pipeline components

In [87]:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])

In [None]:
nlp.remove_pipe("parser")
nlp.rename_pipe("ner", "entityrecognizer")
nlp.replace_pipe("tagger", my_custom_tagger)

In [None]:
def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)

# Customize pipeline components

A component receives a Doc object and can modify it. By adding a component to the pipeline, you’ll get access to the Doc at any point during processing – instead of only being able to modify it afterwards.

| ARGUMENT | TYPE | DESCRIPTION                                          |
|----------|------|------------------------------------------------------|
| doc      | Doc  | The Doc object processed by the previous component.  |
| RETURNS  | Doc  | The Doc object processed by this pipeline component. |

| ARGUMENT | TYPE    | DESCRIPTION                                                        |
|----------|---------|--------------------------------------------------------------------|
| last     | bool    | If set to True, component is added last in the pipeline (default). |
| first    | bool    | If set to True, component is added first in the pipeline.          |
| before   | unicode | String name of component to add the new component before.          |
| after    | unicode | String name of component to add the new component after.           |

In [89]:
import spacy
from spacy.tokens import Doc, Span, Token
import json

def remove_stopwords(doc):
    print("Before stopwords_removal, this doc is: {}".format(doc))
    space_list = [t.whitespace_  for t in doc if not t.is_stop]
    new_doc = Doc(doc.vocab,
              words=[t.orth_ for t in doc if not t.is_stop],
              spaces=space_list
              )
    return new_doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(remove_stopwords, name="stopwords_removal", first=True)
print(nlp.pipe_names)  # ['stopwords_removal', 'tagger', 'parser', 'ner']
doc = nlp("This is a sentence.")
print("After stopwords_removal, this doc is: {}".format(doc))

['stopwords_removal', 'tagger', 'parser', 'ner']
Before stopwords_removal, this doc is: This is a sentence.
After stopwords_removal, this doc is: sentence.


In [90]:
import spacy
from spacy.tokens import Doc, Span, Token
import json

def remove_stopwords(doc):
    print("Before stopwords_removal, this doc is: {}".format(doc))
    space_list = [t.whitespace_  for t in doc if not t.is_stop]
    new_doc = Doc(doc.vocab,
              words=[t.orth_ for t in doc if not t.is_stop],
              spaces=space_list
              )
    return new_doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(remove_stopwords, name="stopwords_removal", before='tagger')
print(nlp.pipe_names)  # ['stopwords_removal', 'tagger', 'parser', 'ner']
doc = nlp("This is a sentence.")
print("After stopwords_removal, this doc is: {}".format(doc))

['stopwords_removal', 'tagger', 'parser', 'ner']
Before stopwords_removal, this doc is: This is a sentence.
After stopwords_removal, this doc is: sentence.


In [91]:
import spacy
from spacy.tokens import Doc, Span, Token
import json

def remove_stopwords(doc):
    print("Before stopwords_removal, this doc is: {}".format(doc))
    space_list = [t.whitespace_  for t in doc if not t.is_stop]
    new_doc = Doc(doc.vocab,
              words=[t.orth_ for t in doc if not t.is_stop],
              spaces=space_list
              )
    return new_doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(remove_stopwords, name="stopwords_removal", after='ner')
print(nlp.pipe_names)  # ['tagger', 'parser', 'ner', 'stopwords_removal']
doc = nlp("This is a sentence.")
print("After stopwords_removal, this doc is: {}".format(doc))

['tagger', 'parser', 'ner', 'stopwords_removal']
Before stopwords_removal, this doc is: This is a sentence.
After stopwords_removal, this doc is: sentence.


# Extension attributes

# Attribute extensions

In [None]:
 
Doc.set_extension("hello", default=True)
assert doc._.hello
doc._.hello = False

In [None]:
def recovery_xref_inner_text(doc):
    xref_mapping: Dict = doc._.xref_mapping
    new_doc[in_pos]._.rid = rid
    return new_doc

# Property extensions

In [None]:
Doc.set_extension("hello", getter=get_hello_value, setter=set_hello_value)
assert doc._.hello
doc._.hello = "Hi!"

# Method extensions

In [None]:
Doc.set_extension("hello", method=lambda doc, name: "Hi {}!".format(name))
assert doc._.hello("Bob") == "Hi Bob!"