**spaCy** is a free and open source library for doing advanced Natural Language Processing (NLP).

- It supports more than 20 languages.
- Provides ready-made statistical models for many languages.
- Interoperability with popular machine learning libraries like PyTorch, TensorFlow, and Scikit-learn.
- Cross-platform.
- Offers the fastest syntactic parser in the world.

**Features:**

- spaCy is written using Python and Cython.
- Unlike NLTK, which is more teaching and research-oriented, spaCy is more performance oriented and is hence more common in production environments.
- Built especially for a production environment.
- Helps to process and analyze large collections of text data.

The features offered by spaCy include:

 - Tokenization
 - Part-of-speech (POS) tagging
 - Lemmatization
 - Sentence Boundary Detection (SBD)
 - Named Entity Recognition (NER)
 - Similarity
    
Note: Stemming and Lemmatization are Normalization techniques in NLP.

In [1]:
#! python -m spacy download en 
#! python -m spacy download fr

In [2]:
import spacy
nlp = spacy.load('en')

In [3]:
docx = nlp(u'SpaCy is an amazing tool like NLTK')
docx

SpaCy is an amazing tool like NLTK

In [4]:
tokens = [token.text for token in docx]
tokens

['SpaCy', 'is', 'an', 'amazing', 'tool', 'like', 'NLTK']

**Tokenization**

In spaCy, tokens are represented by objects of the Token class. spaCy can tokenize the data into individual tokens based on the rules of the language so that they can be analyzed more efficiently.

Though the task of tokenization may be as simple as splitting the sentences based on the appearance of spaces at first, it can be a much more complicated task, which is intricately linked with the language in question.

For example, in "What's your name?", the tokens are "What" and "'s" and not just "What's".

spaCy stores the strings associated with the tokens in **hashed form** to save space.

**Sentence Boundary Detection**

The input text data often needs to be split into sentences for NLP. The process of identifying parts of the input data that can be classified as sentences is known as sentence boundary detection.

This task can be complicated due to the presence of dots as part of names, web addresses, etc.

For example,

The author, H. G. Wells, wrote a lot of other books.

A naive detection algorithm may be fooled into thinking that each dot ends a sentence, which is not the case.

spaCy intelligently performs sentence boundary detection considering such details.

In [5]:
def mycustom_boundary(docx):
    for token in docx[:-1]:
        if token.text == '---':
            docx[token.i+1].is_sent_start = True
            
    return docx

In [6]:
# Adding the rule before parsing
nlp.add_pipe(mycustom_boundary, before='parser')

In [7]:
mysentence = nlp(u"Hello world---NLP---Spacy")

In [8]:
for sentence in mysentence.sents:
    print(sentence)

Hello world---
NLP---
Spacy


Note: Use **from spacy.pipeline import SentenceSegmenter** for Custom Sentence Boundary

### Named Entity Recognition

A **named entity** is a real-world object that is given a name. For example, a person who is given a name like 'John' or a book with the title 'One Hundred Years of Solitude'.

Named entity recognition (NER) is the process of identifying the parts of the input text and classifying them into a set of pre-defined categories or named entities.

spaCy is capable of performing NER with the assistance of statistical models.

spaCy has some pre-trained models, but they need to be fine-tuned to fit our needs more specifically.

**Similarity**

spaCy can compare two entities and predict how similar they are. This ability is highly useful in identifying duplicate entries and in finding recommendations.

Every Token object has a similarity() method which gives a number. Higher this number, higher the similarity.

This means, a token when compared with itself would give the maximum similarity score (which may not always be 1 due to vector math and floating point imprecisions).

See the table below to see an example with three words:

                dog	 cat	banana
        dog    1.00	0.80	0.24
        cat	0.80	1.00	0.28
     banana	0.24	0.28	1.00

In [9]:
doc1 = nlp("wolf")
doc2 = nlp("dog")
doc1.similarity(doc2)

0.6759108589205962

### SpaCy Architecture

#### Processing Pipeline

When the model of spaCy is used, spaCy performs tokenization on the input text to give an object of the **Doc** class. This Doc object is further processed in a procedure involving a set of processes known as the processing pipeline.

spaCy has some default models that are optimized for performance. However, custom models can also be developed.

The default pipeline consists of:

- a tagger for identifying the POS tags,
- a parser
- an entity recognizer
    
Each of these parts takes the Doc object and returns a processed version of it, which is another Doc object and passes it on to the next component in the pipeline.

The primary data structures in spaCy are:

- Doc
- Vocab
    
Doc object has the collection of tokens from the input text data.

Vocab object provides a set of look-up tables, which ensures that common information is available across documents.

Strings and word vectors are stored in a centralized manner and there is only one source of this data, which ensures integrity.

A Doc object is a collection of Token objects.

**Span** and **Token** objects are effectively mere views of the parts of the Doc object. The real data is owned by the Doc.

Doc object has many **metadata** including **annotations** and other **linguistic** information.

Slicing a Doc object produces Span objects.

In [10]:
# doc is a Doc object and doc_slice is a Span object.
doc  = [x for x in nlp('Spacy is used for natural language processing.')]
doc_slice = doc[4:]
doc_slice

[natural, language, processing, .]

In [11]:
doc = nlp("The quick brown fox jumps over the lazy dog")
span = doc[1:3]
span

quick brown

In [12]:
# Tokens
# Note that even the full stop denoting the end of the sentence will be a Token.
doc = nlp('Spacy is used for doing natural language processing.')
l = [token for token in doc]
l

[Spacy, is, used, for, doing, natural, language, processing, .]

#### String Store

spaCy stores all the strings in the data that it is handling in a centralized manner in a location called the **String Store**.

spaCy handles the strings in terms of its **hashes** as it saves space.

When the string version of the 64-bit hash is needed, SpaCy consults the string store to obtain it.

The 'single-source' way of storing the string ensures integrity and consistency.

If we get a hash version of a string using the attribute of an object, we can get the string version by appending an underscore (_) to the attribute's name.

For example, we can get the hash form of the POS tag of a Token object via its **pos** attribute. We can also get the string version using **pos_**.

#### Vocab and Lexeme

An object of the Vocab class in spaCy stores the words or vocabulary along with other data shared across a particular language.

Each entry in an object of the Vocab class is known as a **Lexeme**. Unlike a Token, a Lexeme has **no contextual information** like POS tag. It is just a word type.

#### POS Tagging

spaCy can use statistical models to analyze the input text data and predict the tag or label of the constituent words.

The input text would have been first made into a Doc object.

This process is known as Part of Speech Tagging or POS tagging.

The statistical model used is trained by showing numerous examples. For instance, the word that comes after a "the" in a sentence is probably a noun (NN).

In [13]:
d = nlp('Spacy is used for doing natural language processing.')
# POS tags predicted by the English model.
print([(token, token.pos_) for token in d])

[(Spacy, 'PROPN'), (is, 'VERB'), (used, 'VERB'), (for, 'ADP'), (doing, 'VERB'), (natural, 'ADJ'), (language, 'NOUN'), (processing, 'NOUN'), (., 'PUNCT')]


In [14]:
# Help tool for POS tags
spacy.explain('ADP')

'adposition'

In [15]:
ex = nlp('Sally likes Sam')
for word in ex:
    print(word.text, word.pos_, word.tag_, word.dep_)

Sally PROPN NNP advmod
likes VERB VBZ ROOT
Sam PROPN NNP dobj


In [16]:
spacy.explain('advmod')

'adverbial modifier'

#### Visualize Dependency using displaCy

displaCy is used for visualising the syntactic dependencies and POS tags.

displaCy ENT is used to visualize the named entities. It highlights the named entities and their labels.

In [17]:
from spacy import displacy
# Below code currently not working with the installed version...
#displacy.render(ex, style='dep', jupyter=True)
# The below will create a simple web server with the visualization which we can view using a web browser.
#displacy.serve(ex, style='dep')

In [18]:
doc1 = nlp('A demo of displaCy.')
#displacy.serve(doc, style='dep')
displacy.render(doc1, style='ent', jupyter=True)

In [19]:
doc2 = nlp("Leo Tolstoy wrote 'War and Peace'.")
#displacy.serve(doc, style='ent')
displacy.render(doc2, style='ent', jupyter=True)

#### Visualization as Raw HTML

If we do not want to set up a web server just for the sake of seeing the visualization, displaCy can give us the HTML code that can be used to display it.

The displacy.render() method, with its **page** argument **True**, can be used for this.

In [20]:
# html = displacy.render([doc1, doc2], style='dep', page=True)
# Now the variable html will have the HTML code for generating the visualization.

In [21]:
temp = nlp("Generations to come may not believe that such a man lived.")
[(word, word.pos_) for word in temp]

[(Generations, 'NOUN'),
 (to, 'PART'),
 (come, 'VERB'),
 (may, 'VERB'),
 (not, 'ADV'),
 (believe, 'VERB'),
 (that, 'ADP'),
 (such, 'ADJ'),
 (a, 'DET'),
 (man, 'NOUN'),
 (lived, 'VERB'),
 (., 'PUNCT')]

In [22]:
doc  = nlp('Spacy is used for natural language processing.')
print([(ent.text, ent.label_) for ent in doc.ents])

[('Spacy', 'GPE')]


In [23]:
test = nlp("Leo Tolstoy, the great Russian writer, is well known for his magnum opus 'War and Peace'.")
print([(ent.text, ent.label_) for ent in test.ents])

[('Leo Tolstoy', 'PERSON'), ('Russian', 'NORP'), ("War and Peace'", 'WORK_OF_ART')]


In [24]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

#### Stop words

One of the first steps done in **early information retrieval systems** which were usually based on keyword searching was the removal of stop words.

Stop words are words which most frequently appear in the sentences of a language like conjunctions ("and", "or", "but", etc).

The method of removing stop words from input text before it is analyzed can still be useful.

In [25]:
from spacy.lang.en.stop_words import STOP_WORDS

In [26]:
print(nlp.vocab['the'].is_stop)

True


In [27]:
# Custom setting of the stop-words
nlp.vocab['the'].is_stop = False
print(nlp.vocab['the'].is_stop)

False


#### Chunking

The process of dividing sentences into segments that do not overlap is known as **text chunking**.

Specific extraction of noun phrases in known as **Noun phrase chunking** or **NP chunking**. Noun phrases are important as they are often the keywords and are highly useful in information retrieval systems.

Examples of chunks include "the azure blue sky", "world's largest river", etc.

In [28]:
doc  = nlp('Spacy is used for natural language processing.')
print([x for x in doc.noun_chunks])
print([x.text for x in doc.noun_chunks])
print([x.root.text for x in doc.noun_chunks])
# Tuple root.head.text - connecting text
print([(x.root.text, x.root.head.text) for x in doc.noun_chunks])

[Spacy, natural language processing]
['Spacy', 'natural language processing']
['Spacy', 'processing']
[('Spacy', 'used'), ('processing', 'for')]


#### Parser

Spacy's syntactic dependency parser is one of the best in the world and is one of its main features that make spaCy stand out. This parser also performs **sentence boundary detection**.

We can check if Doc object is parsed using its **is_parsed** boolean attribute which will be True if parsing has been performed.

#### Dependency Parsing

Spacy constructs a parse tree to find out the **dependencies** between the words in the input text. This dependency tree is useful for **text chunking**.

Every token in the parse tree other than the root will have only one parent. The parent of a token in the parse tree is known as its head in spacy terminology. Also, the child tokens of a token, if any, in the parse tree are called its children.

The syntactic relation between the head and the child is known as **dep**. The dep of a Token can be obtained from its **dep** attribute.

**Subtree of Parse Tree**

We can get the whole phrase associated with a Token by using its subtree attribute which will give a generator giving out children of a subtree of the parse tree with that token as root.

In [29]:
# Dependency Parsing
# children attribute of a Token would give a generator for its children in the parse tree.
# text - the text of the token
# dep_ - dependency relation as a string
# Note: dep would give the hash as Spacy uses hashes to save space.
# head.text - parent in a dependency graph as a string
# children - the list of children for the token in the dependency graph
doc  = nlp('Spacy is used for natural language processing.')
for t in doc:
  print(t.text, t.dep_, t.head.text, [child for child in t.children])

Spacy nsubjpass used []
is auxpass used []
used ROOT used [Spacy, is, for, .]
for prep used [processing]
natural amod language []
language compound processing [natural]
processing pobj for [language]
. punct used []


In [30]:
spacy.explain('auxpass')

'auxiliary (passive)'

In [31]:
doc  = nlp('Spacy is used for natural language processing.')
print(doc[-2])
print([x for x in doc[-2].subtree])

processing
[natural, language, processing]


**Navigating the Parse Tree**

- **left_edge** and **right_edge** attributes of a token will give the left-most (i.e., the first) and the right-most (i.e., the last) tokens in the subtree respectively.
- **lefts** and **rights** attributes of a Token returns a generator generating the tokens that appear on its left and right subtrees respectively in the syntactic parse tree.
- Similarly **n_lefts** and **n_rights** attributes give the number of tokens in its left and right subtrees respectively.
- **ancestors** attribute can be used to iterate over the ancestors of a token.
- **is_ancestor()** method can be used to check whether a token is an ancestor of another token.

In [32]:
# Lemma Example
ra = nlp(u'ran')
for token in ra:
    print(token.lemma_)

run
