# spaCy  


Overview:  
-spaCy Install (cuda support)      
-English language Install  
-Tokenization and Parts-of-Speech Tagging     
-Using word vectors and Similarity   

# spaCy is another popular NLP library

spaCy is a new Python NLP library with great features: focuses on production usage and supports deep learning frameworks. Excellent for very large text files.  (NLTK is more of a starter tool for language processing.)  

Capabilities:    
-Non-destructive tokenization  
-Named entity recognition  
-Support for 55+ languages  
-17 statistical models for 11 languages  
-pretrained word vectors  
-State-of-the-art speed  
-Easy deep learning integration  
-Part-of-speech tagging  
-Labeled dependency parsing  
-Syntax-driven sentence segmentation  
-Built in visualizers for syntax and NER  
-Convenient string-to-hash mapping  
-Export to numpy data arrays  
-Efficient binary serialization  
-Easy model packaging and deployment  
-Robust, rigorously evaluated accuracy  
   
https://spacy.io/docs/#examples

**Excellent article on comparison of NLTK libraries:**    
https://medium.com/activewizards-machine-learning-company/comparison-of-top-6-python-nlp-libraries-c4ce160237eb  

**cuda support for spaCy**

**spaCy v2.0** comes with neural network models that are implemented in the machine learning library Thinc. For GPU support, we’ve been grateful to use the work of Chainer’s CuPy module which provides a numpy-compatible interface for GPU arrays.    
**spaCy** can be installed on GPU by specifying:  
`pip install spacy[cuda]
pip install spaCy[cuda112]`

Or any other cuda version you have: spacy[cuda90], spacy[cuda91], spacy[cuda92] or spacy[cuda100]       
If you know your cuda version, using the more explicit specifier allows `cupy` to be installed via wheel saving some compilation time. The specifiers should install `cupy`.  

If you have issues with **spaCy for CUDA (GPU)**, uninstall it
#!pip uninstall spacy[cuda90]

To activate gpu use:
#spacy.prefer_gpu()

In [1]:
# To install reference:  https://spacy.io/usage
# pip install -U spacy
#pip install --upgrade spacy  #alternatively
#
# or uncomment the following and install in the python notebook
#!pip install -U spacy

In [2]:
#If you have not updated python pip command run the following command:
#!python -m pip install --upgrade pip

Some commands such as the following may require administrator priviledges or you will receive a warning: You do not have sufficient privilege to perform this operation.      
Open a cmd window as an Administrator, run the command and then open a Jupyter notebook.  
On Linux all installations whould be done as sudo user  
#!python -m spacy download en  
#!sudo python -m spcy download en  # Linux  

# Install Language Modules
https://spacy.io/models/en  
Need to also install language modules separately. They come in sizes, small (sm), medium (md), and large (lg).  
To install large English module:  
    
    $ python -m spacy download en_core_web_lg
    
Spacy has 4 English modules:    
* en_core_web_sm (assigns context-specific token vectors, POS tags, dependency parse and named entities. Contains no vectors.)  
* en_core_web_md  (assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.)    
* en_core_web_lg  (same as above)    
* en_vectors_web_lg (same as above, Contains > 1 million unique vectors.)  

**Important: You download modules in the command window as Administrator (at least for Windows.)**
Once you run the above command, you will be informed:  
You can now load the model via spacy.load('en')  

In [3]:
#python -m spacy download en
!python -m spacy download en_core_web_lg

2023-11-13 22:06:19.219004: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-13 22:06:19.219071: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-13 22:06:19.219114: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-13 22:06:19.231119: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-13 22:06:25.471246: I tensorflow/c

In [4]:
#Make sure one of the languages is downloaded e.g., python -m spacy download en_core_web_lg
import spacy
#parser = spacy.load('en') #specifies the language model
nlp = spacy.load("en_core_web_lg")

# Tokenization in spaCy

Tokenization is the process of breaking up a sequences of texts into words, keywords, phrases, symbols, or sentences called tokens.   
Each individual part is known as a **token**.
Typically punctuation is removed.
The input to the tokenizer is a unicode text, and the output is a Doc object.
To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

Important note spaCy’s tokenization is non-destructive, which means that you’ll always be able to reconstruct the original input from the tokenized output. Whitespace information is preserved in the tokens and no information is added or removed during tokenization. This is kind of a core principle of spaCy’s Doc object: doc.text == input_text should always hold true.

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them

**Tokenizer exception**   
Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.
Prefix: Character(s) at the beginning, e.g. $, (, “, ¿.
Suffix: Character(s) at the end, e.g. km, ), ”, !.
Infix: Character(s) in between, e.g. -, --, /, ….

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

In [5]:
#Preamble to the U.S. Declaration of Independence, 1776
preamble='We hold these truths to be self-evident, that all men are created equal, that they are endowed, by their Creator, with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness.  That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or abolish it, and to institute new Government, laying its foundation on such principles, and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness.'
print(preamble)

We hold these truths to be self-evident, that all men are created equal, that they are endowed, by their Creator, with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness.  That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or abolish it, and to institute new Government, laying its foundation on such principles, and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness.


In [6]:
#tokens = parser(preamble) #Alternatively to read data into a token object
tokens = nlp(preamble) #Alternatively to read data into a token object

# methods: rth_ checks verbatim text content, isspace() checks if string is a space
token_words = [token.orth_ for token in tokens if not token.orth_.isspace()]
print(token_words)

['We', 'hold', 'these', 'truths', 'to', 'be', 'self', '-', 'evident', ',', 'that', 'all', 'men', 'are', 'created', 'equal', ',', 'that', 'they', 'are', 'endowed', ',', 'by', 'their', 'Creator', ',', 'with', 'certain', 'unalienable', 'Rights', ',', 'that', 'among', 'these', 'are', 'Life', ',', 'Liberty', ',', 'and', 'the', 'pursuit', 'of', 'Happiness', '.', 'That', 'to', 'secure', 'these', 'rights', ',', 'Governments', 'are', 'instituted', 'among', 'Men', ',', 'deriving', 'their', 'just', 'powers', 'from', 'the', 'consent', 'of', 'the', 'governed', ',', 'That', 'whenever', 'any', 'Form', 'of', 'Government', 'becomes', 'destructive', 'of', 'these', 'ends', ',', 'it', 'is', 'the', 'Right', 'of', 'the', 'People', 'to', 'alter', 'or', 'abolish', 'it', ',', 'and', 'to', 'institute', 'new', 'Government', ',', 'laying', 'its', 'foundation', 'on', 'such', 'principles', ',', 'and', 'organizing', 'its', 'powers', 'in', 'such', 'form', ',', 'as', 'to', 'them', 'shall', 'seem', 'most', 'likely', 't

In [7]:
#Alternatively to read data into a token object
tokens = nlp(preamble)
print(tokens)

We hold these truths to be self-evident, that all men are created equal, that they are endowed, by their Creator, with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness.  That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or abolish it, and to institute new Government, laying its foundation on such principles, and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness.


# Merging and splitting tokens  
The `Doc.retokenize` context manager lets you merge and split tokens. Modifications to the tokenization are stored and performed all at once when the context manager exits. To merge several tokens into one single token, pass a Span to retokenizer.merge. An optional dictionary of attrs lets you set attributes that will be assigned to the merged token – for example, the lemma, part-of-speech tag or entity type. By default, the merged token will receive the same attributes as the merged span’s root.

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I live in New York")
print("Before:", [token.text for token in doc])

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[3:5], attrs={"LEMMA": "new york"})
print("After:", [token.text for token in doc])

Before: ['I', 'live', 'in', 'New', 'York']
After: ['I', 'live', 'in', 'New York']


# Part of Speech Tagging
After tokenization, spaCy can parse and tag a given document object. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in the context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalizes across the language – for example, a word following “the” in English is most likely a noun.

Linguistic annotations are available as Token attributes. **Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency**. To obtain the readable string representation of an attribute we need to add an underscore _ to its name:

In [9]:
print("Noun phrases:",[chunk.text for chunk in tokens.noun_chunks])

Noun phrases: ['We', 'these truths', 'all men', 'they', 'their Creator', 'certain unalienable Rights', 'these', 'Life', 'Liberty', 'the pursuit', 'Happiness', 'That', 'these rights', 'Governments', 'Men', 'their just powers', 'the consent', 'any Form', 'Government', 'these ends', 'it', 'the Right', 'the People', 'it', 'new Government', 'its foundation', 'such principles', 'its powers', 'such form', 'them', 'their Safety', 'Happiness']


In [10]:
print("Verb phrases:",[token.lemma_ for token in tokens if token.pos_ =="VERB"])

Verb phrases: ['hold', 'create', 'endow', 'secure', 'institute', 'derive', 'govern', 'become', 'alter', 'abolish', 'institute', 'lay', 'organize', 'seem', 'effect']


In [11]:
#Find named entities, phrases and concepts
for entity in tokens.ents:
    print(entity.text, entity.label_)

Rights ORG
Happiness ORG
Men ORG
the Right of the People WORK_OF_ART
institute new Government ORG
Safety and Happiness ORG


In [12]:
nlp = spacy.load("en_core_web_lg")
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
a a DET DT det x True True
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [13]:
#spaCy does not make a sentence segmentation error and treats Ph.D. correctly just like NLTK
text2 = "Mr. Smith loves tacos. He has a Ph.D. in taco-ology."
tokens = nlp(text2)
tokens = [token.orth_ for token in tokens if not token.orth_.isspace()]
print(tokens)

['Mr.', 'Smith', 'loves', 'tacos', '.', 'He', 'has', 'a', 'Ph.D.', 'in', 'taco', '-', 'ology', '.']


A comparison to **nltk**.  Notice how the - in taco-ology separates words in **spaCy**.

In [15]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [16]:

text2 = "Mr. Smith loves tacos. He has a Ph.D. in taco-ology."
tokens = nltk.word_tokenize(text2)
print(tokens)

['Mr.', 'Smith', 'loves', 'tacos', '.', 'He', 'has', 'a', 'Ph.D.', 'in', 'taco-ology', '.']


<p>In earlier NL days of spaCy.en (English language) exceptions were added like:    

<p>spacy.en.English.Defaults.tokenizer_exceptions["Ph.D."] = [{"F": "Ph.D."}]  

<p>What you're doing here is telling the tokenizer, "when you see this chunk, 'Ph.D.', process it into these tokens".
<p>The list you're giving spaCy has just one token, and you've specified its form (with the F key).
<p>You can also specify its POS and lemma (with P and L).

In [17]:
# Following methods can be used on tokens.
dir(tokens[0])

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',


In [18]:
# Define a function that prints the most essential attributes of a token.
def print_token(tokens):
    print("==========================")
    print("value:",token.text)   #  Verbatim text
    print("lemma:",token.lemma_) # lemma is the root of a word
    print("shape:",token.shape_) # shape is capitalization and punctuation

In [19]:
# Note the lemma for "was/be" and "flew/fly"
text3 = "Jack flew to the store because he was hungry and forgot milk and eggs."
tokens = nlp(text3)
for token in tokens:
    print_token(token)

value: Jack
lemma: Jack
shape: Xxxx
value: flew
lemma: fly
shape: xxxx
value: to
lemma: to
shape: xx
value: the
lemma: the
shape: xxx
value: store
lemma: store
shape: xxxx
value: because
lemma: because
shape: xxxx
value: he
lemma: he
shape: xx
value: was
lemma: be
shape: xxx
value: hungry
lemma: hungry
shape: xxxx
value: and
lemma: and
shape: xxx
value: forgot
lemma: forgot
shape: xxxx
value: milk
lemma: milk
shape: xxxx
value: and
lemma: and
shape: xxx
value: eggs
lemma: egg
shape: xxxx
value: .
lemma: .
shape: .


In [20]:
# Alternative, load the German language
!python -m spacy download de_core_news_sm
import de_core_news_sm
nlp2 = de_core_news_sm.load()   # notice that we using module_name.load()

2023-11-13 22:08:39.996814: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-13 22:08:39.996879: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-13 22:08:39.996916: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Collecting de-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.6.0/de_core_news_sm-3.6.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: de-core-news-sm
Successfully in

In [21]:
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)
print(doc)

When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously. “I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to,” said Thrun, in an interview with Recode earlier this week.


In [22]:
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print('******************')
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
print('******************')
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my hand', 'I', 'Thrun', 'an interview', 'Recode']
******************
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']
******************
Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE


In [23]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP dobj X.X. False False
startup startup NOUN NN dep xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


Text: The original word text.   
Lemma: The base form of the word.   
POS: The simple part-of-speech tag.   
Tag: The detailed part-of-speech tag.
Dep: Syntactic dependency, i.e. the relation between tokens.   
Shape: The word shape – capitalization, punctuation, digits.
isalpha: Is the token an alpha character?   
isstop: Is the token part of a stop list.  

To understand entities:   
Use:  
**spacy.explain**

In [24]:
spacy.explain("VBZ")

'verb, 3rd person singular present'

# Visualizer

Visualizing a dependency parse or named entities in a text is not only a fun NLP demo – it can also be incredibly helpful in speeding up development and debugging your code and training proces
Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its dependencies look like:

In [25]:
spacy.displacy

<module 'spacy.displacy' from '/usr/local/lib/python3.10/dist-packages/spacy/displacy/__init__.py'>

In [26]:
### this cell migth stall your notebook

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
displacy.serve(doc, style="dep")  #ent or dep


# https://spacy.io/usage/visualizers


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


# Dependency Parsing
spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”. You can check whether a Doc object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.

# Noun chunks
Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”. To get the noun chunks in a document, simply iterate over Doc.noun_chunks

In [27]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text,'-', chunk.root.text, '-', chunk.root.dep_,
            chunk.root.head.text)

Autonomous cars - cars - nsubj shift
insurance liability - liability - dobj shift
manufacturers - manufacturers - pobj toward


Text: The original noun chunk text.  
Root text: The original text of the word connecting the noun chunk to the rest of the parse.   
Root dep: Dependency relation connecting the root to its head.  
Root head text: The text of the root token’s head.

`TEXT	ROOT.TEXT	ROOT.DEP_	ROOT.HEAD.TEXT
Autonomous cars	cars	nsubj	shift
insurance liability	liability	dobj	shift
manufacturers	manufacturers	pobj	toward`

Navigating the parse tree
spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

In [28]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Autonomous amod cars NOUN []
cars nsubj shift VERB [Autonomous]
shift ROOT shift VERB [cars, liability, toward]
insurance compound liability NOUN []
liability dobj shift VERB [insurance]
toward prep shift VERB [manufacturers]
manufacturers pobj toward ADP []


`Text: The original token text.   
Dep: The syntactic relation connecting child to head.   
Head text: The original text of the token head.   
Head POS: The part-of-speech tag of the token head.   
Children: The immediate syntactic dependents of the token.`



Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:

In [29]:
import spacy
from spacy.symbols import nsubj, VERB

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")

# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
print(verbs)

{shift}


In [30]:
# Finding a verb with a subject from above — less good
verbs = []
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break
print(verbs)

[shift]


In [31]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"bright red apples on the tree")
print([token.text for token in doc[2].lefts])  # ['bright', 'red']
print([token.text for token in doc[2].rights])  # ['on']
print(doc[2].n_lefts)  # 2
print(doc[2].n_rights)  # 1

['bright', 'red']
['on']
2
1


# Named Entity Recognition  
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title.     

spaCy recognizes named entities in a document by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.    

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.  

In [32]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion from John, Peter from Harvard University")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
John 60 64 PERSON
Peter 66 71 PERSON
Harvard University 77 95 ORG


`Text: The original entity text.     
Start: Index of start of entity in the Doc.   
End: Index of end of entity in the Doc.    
LabeL: Entity label, i.e. type. `

`TEXT	START	END	LABEL	DESCRIPTION
Apple	0	5	ORG	Companies, agencies, institutions.
U.K.	27	31	GPE	Geopolitical entity, i.e. countries, cities, states.
$1 billion	44	54	MONEY	Monetary values, including unit.`

In [33]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


# Sentence Segmentation
A Doc object’s sentences are available via the `Doc.sents` property.


spaCy uses the dependency parser to determine sentence boundaries.
**This is usually more accurate than a rule-based approach, but it also means you’ll need a statistical model and accurate predictions.** If your texts are closer to general-purpose news or web text, this should work well out-of-the-box. For social media or conversational text that doesn’t follow the same rules, your application may benefit from a custom rule-based implementation. You can either use the built-in Sentencizer or plug an entirely custom rule-based function into your processing pipeline.

spaCy’s dependency parser respects already set boundaries, so you can preprocess your Doc using custom rules before it’s parsed. Depending on your text, this may also improve accuracy, since the parser is constrained to predict parses consistent with the sentence boundaries.

To view a Doc’s sentences, you can iterate over the `Doc.sents`, a generator that yields Span objects.

In [34]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

This is a sentence.
This is another sentence.


# Rule-based pipeline component
The Sentencizer component is a pipeline component that splits sentences on punctuation like ., ! or ?. You can plug it into your pipeline if you only need sentence boundaries without the dependency parse.

# Word Vectors and Semantic Similarity
**TRAINING WORD VECTORS**    
Dense, real valued vectors representing distributional similarity information are now a cornerstone of practical NLP. The most common way to train these vectors is the Word2vec methods. If you need to train a word2vec model we recommend the implementation using Gensim.  

Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec.  A few examples:

array([2.0228000e-01, -7.66180009e-02, 3.70219992e-01, ... dtype=float32)  
Data in text8:   
71291 300  
santamaria -0.328541 0.143057 0.200979 ...

NOTE:  
**spaCy’s** small models (packages that end in sm) don’t ship with word vectors for compactness purposes. They only include context-sensitive tensors. This means you can still use the `similarity()` methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger model:  `en_vectors_web_lg` or `en_vectors_web_lg`.

In [35]:
import spacy
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")
banana = nlp(u"banana")
banana.vector

array([-1.43172   , -1.0467288 , -0.780927  ,  0.45387337,  0.22313687,
       -0.4822587 ,  0.7796049 ,  1.9291902 ,  0.5263105 , -0.8860629 ,
        0.31216276, -0.8671374 , -0.97779465,  0.62443805, -0.24205405,
        0.70003366, -1.1135058 , -1.335113  ,  1.551111  , -0.27649075,
       -0.49291223,  0.21266821, -1.2886872 , -0.4015749 ,  1.4845917 ,
        0.26011398,  0.244566  ,  1.1891397 , -0.4271981 ,  0.32347038,
       -0.2853762 ,  0.9585209 ,  1.3764503 ,  0.02244225, -0.4406012 ,
       -2.0229402 ,  0.14719443,  1.6322699 ,  0.8777428 , -1.1011673 ,
       -0.5580491 ,  0.16557175, -0.49441516,  0.7951485 , -0.7971488 ,
       -0.04855184, -0.6682936 ,  0.65149677,  0.28960374,  0.02061486,
        0.97740465,  0.2379916 ,  0.72662365, -0.71118784,  0.10751665,
       -0.0982407 ,  0.66061974, -0.23507589,  0.304424  , -0.65600216,
       -1.2755834 , -0.41828483,  1.0467389 , -0.20455568,  0.1618619 ,
        0.6822717 , -0.06672494, -0.16852754,  0.12145358, -0.42

Dense, real valued vectors representing distributional similarity information are now a cornerstone of practical NLP. The most common way to train these vectors is the Word2vec family of algorithms. If you need to train a word2vec model, we recommend Gensim, a library implemented in the Python.

Models that come with built-in word vectors make them available as the `Token.vector` attribute. `Doc.vector` and `Span.vector` will default to an average of their token vectors. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalize vectors.

In [36]:
import spacy

nlp = spacy.load('en_core_web_sm')
tokens = nlp(u'dog cat banana afskfsd')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 6.814786 True
cat True 7.3709016 True
banana True 7.6460695 True
afskfsd True 7.192256 True


`Text: The original token text.  
has vector: Does the token have a vector representation?  
Vector norm: The L2 norm of the token’s vector (the square root of the sum of the values squared)`  
`OOV: Out-of-vocabulary`

The words “dog”, “cat” and “banana” are all pretty common in English, so they’re part of the model’s vocabulary, and come with a vector. The word “afskfsd” on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0, which means it’s practically nonexistent. If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger models or loading in a full vector package. Example: en_vectors_web_lg, which includes over 1 million unique vectors.  
  
spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that’s similar to what they’re currently looking at, or label a support ticket as a duplicate if it’s very similar to an already existing one.  

Use:  
**.similarity()**

In [37]:
import spacy

nlp = spacy.load('en_core_web_lg')  # make sure to use larger model!
tokens = nlp(u'dog cat banana ship')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.8220816850662231
dog banana 0.20909050107002258
dog ship 0.07517027109861374
cat dog 0.8220816850662231
cat cat 1.0
cat banana 0.2235882580280304
cat ship 0.040839508175849915
banana dog 0.20909050107002258
banana cat 0.2235882580280304
banana banana 1.0
banana ship 0.05500999465584755
ship dog 0.07517027109861374
ship cat 0.040839508175849915
ship banana 0.05500999465584755
ship ship 1.0


In this case, the model’s predictions are pretty on point. A dog is very similar to a cat, whereas a banana is not very similar to either of them. Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).

In [38]:
# !pip install notebook --upgrade

### Importing Embedded Vector Models
Models with names ending in _lg (large) contain severla hundreds or milions of 300 dimensional embedded vectors.

In [39]:
import spacy

nlp = spacy.load("en_core_web_lg")
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 75.254234 False
cat True 63.188496 False
banana True 31.620354 False
afskfsd False 0.0 True


`is_oov` identifies words not in the vocabulary, i.e out of vocabulary (oov)

#### Similarity method
Similarity method produces `cosine()` distance (similarity) betwen tokens.

In [40]:
nlp = spacy.load("en_core_web_lg")  # make sure to use larger model!
tokens = nlp("dog cat banana")

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.8220816850662231
dog banana 0.20909050107002258
cat dog 0.8220816850662231
cat cat 1.0
cat banana 0.2235882580280304
banana dog 0.20909050107002258
banana cat 0.2235882580280304
banana banana 1.0


### Importing Embedded Vector Models
Models with names ending in _lg (large) contain severla hundreds or milions of 300 dimensional embedded vectors.

In [41]:
import spacy

nlp = spacy.load("en_core_web_lg")
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 75.254234 False
cat True 63.188496 False
banana True 31.620354 False
afskfsd False 0.0 True


`is_oov` identifies words not in the vocabulary, i.e out of vocabulary (oov)

#### Similarity method
Similarity method produces cosine() distance (similarity) betwen tokens.

In [42]:
nlp = spacy.load("en_core_web_lg")  # make sure to use larger model!
tokens = nlp("dog cat banana")

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.8220816850662231
dog banana 0.20909050107002258
cat dog 0.8220816850662231
cat cat 1.0
cat banana 0.2235882580280304
banana dog 0.20909050107002258
banana cat 0.2235882580280304
banana banana 1.0


In [43]:
import numpy as np

def load_glove_model(File):
    print("Loading Glove Model")
    glove_model = {}
    with open(File,'r', encoding='utf-8') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array(split_line[1:], dtype=np.float64)
            glove_model[word] = embedding
    print(f"{len(glove_model)} words loaded!")
    return glove_model

In [44]:
glove = load_glove_model('glove.6B.100d.txt')

Loading Glove Model
400000 words loaded!


In [45]:
type(glove)

dict

In [46]:
glove.get('king')

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [47]:
# You can load Gloveimport pandas as pd
import csv
import pandas as pd

words = pd.read_table('glove.6B.100d.txt', sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

In [48]:
# Then to get the vector for a word:

def vec(w):
  return words.loc[w]

In [49]:
vec('the')

1     -0.038194
2     -0.244870
3      0.728120
4     -0.399610
5      0.083172
         ...   
96    -0.510580
97    -0.520280
98    -0.145900
99     0.827800
100    0.270620
Name: the, Length: 100, dtype: float64

In [50]:
word_matrix = words.to_numpy()

# And to find the closest word to a vector:
def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name