# spaCy



## 1. Introducton

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and currently offers statistical neural network models for English, German, Greek, Spanish, Portuguese, French, Italian, Dutch, Lithuanian, Norwegian and multi-language NER, as well as tokenization for various other languages.

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

## 2. Features

### 2.1 Tokenization

Tokenization is the process of segmenting text into words, punctuations marks etc.

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

sample = "Texas A&M's College Station campus, one of the largest in America, spans 5,200 acres \
plus 350 acres for Research Park. The university is part of the Bryan-College Station \
metropolitan area located within Brazos County in the Brazos Valley (Southeast Central Texas) region, \
an area often referred to as \"Aggieland\". According to the U.S. Census Bureau, as of 2008, the \
population of Brazos County is estimated at 175,122."

doc = nlp(sample)
sample_tokenized = []
for token in doc:
    sample_tokenized.append(token.text)
print(sample_tokenized)

['Texas', 'A&M', "'s", 'College', 'Station', 'campus', ',', 'one', 'of', 'the', 'largest', 'in', 'America', ',', 'spans', '5,200', 'acres', 'plus', '350', 'acres', 'for', 'Research', 'Park', '.', 'The', 'university', 'is', 'part', 'of', 'the', 'Bryan', '-', 'College', 'Station', 'metropolitan', 'area', 'located', 'within', 'Brazos', 'County', 'in', 'the', 'Brazos', 'Valley', '(', 'Southeast', 'Central', 'Texas', ')', 'region', ',', 'an', 'area', 'often', 'referred', 'to', 'as', '"', 'Aggieland', '"', '.', 'According', 'to', 'the', 'U.S.', 'Census', 'Bureau', ',', 'as', 'of', '2008', ',', 'the', 'population', 'of', 'Brazos', 'County', 'is', 'estimated', 'at', '175,122', '.']


spaCy’s tokenization is non-destructive, which means after processing the text, spaCy will keep all the information about the original text intact within the Doc object.

### 2.2 Sentence Boundary Detection (SBD)

This is the process of finding and segmenting text into individual sentences.

In [3]:
for sent in doc.sents:
    print(sent)

Texas A&M's College Station campus, one of the largest in America, spans 5,200 acres plus 350 acres for Research Park.
The university is part of the Bryan-College Station metropolitan area located within Brazos County in the Brazos Valley (Southeast Central Texas) region, an area often referred to as "Aggieland".
According to the U.S. Census Bureau, as of 2008, the population of Brazos County is estimated at 175,122.


### 2.3 Part-of-speech (POS) Tagging

POS tagging assigns word types to tokens, like verb or noun.

In [4]:
for token in doc:
    print(token.text, token.pos_, token.tag_, token.shape_, token.is_alpha, token.is_stop)

Texas PROPN NNP Xxxxx True False
A&M PROPN NNP X&X False False
's PART POS 'x False False
College PROPN NNP Xxxxx True False
Station PROPN NNP Xxxxx True False
campus NOUN NN xxxx True False
, PUNCT , , False False
one NUM CD xxx True True
of ADP IN xx True True
the DET DT xxx True True
largest ADJ JJS xxxx True False
in ADP IN xx True True
America PROPN NNP Xxxxx True False
, PUNCT , , False False
spans VERB VBZ xxxx True False
5,200 NUM CD d,ddd False False
acres NOUN NNS xxxx True False
plus CCONJ CC xxxx True False
350 NUM CD ddd False False
acres NOUN NNS xxxx True False
for ADP IN xxx True True
Research PROPN NNP Xxxxx True False
Park PROPN NNP Xxxx True False
. PUNCT . . False False
The DET DT Xxx True False
university NOUN NN xxxx True False
is VERB VBZ xx True True
part NOUN NN xxxx True True
of ADP IN xx True True
the DET DT xxx True True
Bryan PROPN NNP Xxxxx True False
- PUNCT HYPH - False False
College PROPN NNP Xxxxx True False
Station PROPN NNP Xxxxx True False
metropoli

The attributes of the token object represent the following:

* **text**: The original word text.
* **pos_**: The coarse-grained part-of-speech tag.
* **tag_**: The fine-grained part-of-speech tag.
* **shape_**: Transform of the tokens’s string, to show orthographic features.
* **is_alpha**: Is the token of an alpha character.
* **is_stop**: Is the token a stopword.

### 2.4 Named Entity Recognition (NER)

NER is the process of labelling named “real-world” objects, like persons, companies or locations.

In [5]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Texas A&M's ORG
College Station ORG
one CARDINAL
America GPE
5,200 acres plus 350 acres QUANTITY
Research Park FAC
the Bryan-College Station ORG
Brazos County GPE
the Brazos Valley LOC
Southeast Central Texas LOC
Aggieland PERSON
the U.S. Census Bureau ORG
2008 DATE
Brazos County GPE
175,122 CARDINAL


### 2.5 Lemmatization

Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.

In [6]:
sample_lemma = []
for token in doc:
    sample_lemma.append([token.text, token.lemma_])
print(sample_lemma)

[['Texas', 'texas'], ['A&M', 'a&m'], ["'s", "'s"], ['College', 'college'], ['Station', 'station'], ['campus', 'campus'], [',', ','], ['one', 'one'], ['of', 'of'], ['the', 'the'], ['largest', 'large'], ['in', 'in'], ['America', 'america'], [',', ','], ['spans', 'span'], ['5,200', '5,200'], ['acres', 'acre'], ['plus', 'plus'], ['350', '350'], ['acres', 'acre'], ['for', 'for'], ['Research', 'research'], ['Park', 'park'], ['.', '.'], ['The', 'the'], ['university', 'university'], ['is', 'be'], ['part', 'part'], ['of', 'of'], ['the', 'the'], ['Bryan', 'bryan'], ['-', '-'], ['College', 'college'], ['Station', 'station'], ['metropolitan', 'metropolitan'], ['area', 'area'], ['located', 'locate'], ['within', 'within'], ['Brazos', 'brazos'], ['County', 'county'], ['in', 'in'], ['the', 'the'], ['Brazos', 'brazos'], ['Valley', 'valley'], ['(', '('], ['Southeast', 'southeast'], ['Central', 'central'], ['Texas', 'texas'], [')', ')'], ['region', 'region'], [',', ','], ['an', 'an'], ['area', 'area'], [

### 2.6 Chunking

Chunking is the process of extracting noun phrases from the text.

In [7]:
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

Texas A&M's College Station campus campus nsubj spans
America America pobj in
5,200 acres acres dobj spans
350 acres acres conj acres
Research Park Park pobj for
The university university nsubj is
part part attr is
the Bryan-College Station metropolitan area area pobj of
Brazos County County pobj within
the Brazos Valley (Southeast Central Texas) region region pobj in
an area area nsubj referred
Aggieland Aggieland pobj as
the U.S. Census Bureau Bureau pobj to
the population population nsubjpass estimated
Brazos County County pobj of


The attributes of the token object represent the following:

* **text**: The original word text.
* **root.text**: The original text of the word connecting the noun chunk to the rest of the parse.
* **root.dep_**: Dependency relation connecting the root to its head.
* **root.head.text**: The text of the root token’s head.

### 2.7 Dependency Parsing

Dependency parsing is the process of assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.

In [8]:
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_, [child for child in token.children])

Texas compound A&M PROPN []
A&M poss campus NOUN [Texas, 's]
's case A&M PROPN []
College compound Station PROPN []
Station compound campus NOUN [College]
campus nsubj spans VERB [A&M, Station, ,, one, ,]
, punct campus NOUN []
one appos campus NOUN [of]
of prep one NUM [largest]
the det largest ADJ []
largest pobj of ADP [the, in]
in prep largest ADJ [America]
America pobj in ADP []
, punct campus NOUN []
spans ROOT spans VERB [campus, acres, .]
5,200 nummod acres NOUN []
acres dobj spans VERB [5,200, plus, acres]
plus cc acres NOUN []
350 nummod acres NOUN []
acres conj acres NOUN [350, for]
for prep acres NOUN [Park]
Research compound Park PROPN []
Park pobj for ADP [Research]
. punct spans VERB []
The det university NOUN []
university nsubj is VERB [The]
is ccomp referred VERB [university, part]
part attr is VERB [of]
of prep part NOUN [area]
the det area NOUN []
Bryan compound College PROPN []
- punct College PROPN []
College nmod Station PROPN [Bryan, -]
Station nmod area NOUN [C

The attributes of the token object represent the following:

* **text**: The original word text.
* **dep_**: The syntactic relation connecting child to head.
* **head.text**: The original text of the token head.
* **head.pos_**: The part-of-speech tag of the token head.
* **children tokens**: The immediate syntactic dependents of the token.

spaCy also provides a way to see the dependency parser in action, using its visualization library called displaCy.

In [9]:
from spacy import displacy

visual_sample = "Texas A&M University is a public research university in College Station, Texas."
visual_doc = nlp(visual_sample)

displacy.render(visual_doc, style='dep', jupyter=True, options={'distance': 125})

### 2.8 Word Vectors Similarity

Word vectors similarity is determined by comparing words, text spans and documents and see how similar they are to each other.

In [11]:
nlp = spacy.load('en_core_web_lg')

tokens = nlp('phone computer laptop tablet iPad keyboard cat')
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

phone phone 1.0
phone computer 0.43940938
phone laptop 0.4505272
phone tablet 0.36992612
phone iPad 0.38192368
phone keyboard 0.39879575
phone cat 0.20879981
computer phone 0.43940938
computer computer 1.0
computer laptop 0.677216
computer tablet 0.39069566
computer iPad 0.36855808
computer keyboard 0.54257494
computer cat 0.25593635
laptop phone 0.4505272
laptop computer 0.677216
laptop laptop 1.0
laptop tablet 0.51973575
laptop iPad 0.49825644
laptop keyboard 0.5842703
laptop cat 0.2226368
tablet phone 0.36992612
tablet computer 0.39069566
tablet laptop 0.51973575
tablet tablet 1.0
tablet iPad 0.52852917
tablet keyboard 0.44440588
tablet cat 0.16365997
iPad phone 0.38192368
iPad computer 0.36855808
iPad laptop 0.49825644
iPad tablet 0.52852917
iPad iPad 1.0
iPad keyboard 0.4158086
iPad cat 0.17459533
keyboard phone 0.39879575
keyboard computer 0.54257494
keyboard laptop 0.5842703
keyboard tablet 0.44440588
keyboard iPad 0.4158086
keyboard keyboard 1.0
keyboard cat 0.19105299
cat phon

Doc, Span, and Token objects contain a method called .similarity to compute similarity. As we can see in the code above, the method produces a number between 0 and 1 to represent the similarity. Related objects have a greater similarity score, while less related objects have a lower similarity score.

Besides words, we can also compute the similarity between two sentences.

In [15]:
doc1 = nlp("Texas A&M University is a public research university in College Station, Texas.")
doc2 = nlp("William Marsh Rice University, commonly known as Rice University, is a private research university in Houston, Texas. ")
doc3 = nlp("Duke University is a private research university in Durham, North Carolina.")
doc4 = nlp("iPhone is the best phone in the world.")

print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc1.similarity(doc4))
print(doc2.similarity(doc3))
print(doc2.similarity(doc4))
print(doc3.similarity(doc4))

0.9149662098845188
0.9394194747654675
0.7161376005043467
0.9133130517126389
0.7099489517609865
0.687619863757093


## 3. Summary

spaCy is a great NLP library with a huge amount of features, especially for production use. It provides great performance, efficiency and accuracy with state-of-the-art algorithms for NLP tasks. This spotlight just shows a small parts of the full potential of spaCy. There are more to discover in this amazing library.