In [1]:
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_lg')

In [2]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
doc

Apple is looking at buying U.K. startup for $1 billion

# Tokenization

What is a Token?
A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".

Example: The following sentence can be tokenised by splitting up the sentence into individual words.

	"Cytora is going to PyCon!"
	["Cytora","is","going","to","PyCon!"]

In [3]:
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


# Part-of-speech tagging

- Text: The original word text.
- Lemma: The base form of the word.
- POS: The simple part-of-speech tag.
- Tag: The detailed part-of-speech tag.
- Dep: Syntactic dependency, i.e. the relation between tokens.
- Shape: The word shape – capitalisation, punctuation, digits.
- is alpha: Is the token an alpha character?
- is stop: Is the token part of a stop list, i.e. the most common words of the language?

In [4]:
pd.DataFrame({
    'text':  [x.text   for x in doc],
    'lemma': [x.lemma_ for x in doc],
    'pos':   [x.pos_   for x in doc],
    'tag':   [x.tag_   for x in doc],
    'dep':   [x.dep_   for x in doc],
    'shape': [x.shape_ for x in doc],
    'is_alpha': [x.is_alpha for x in doc],
    'is_stop':  [x.is_stop  for x in doc]
}).loc[:, ['text', 'lemma', 'pos', 'tag', 'dep', 'shape', 'is_alpha', 'is_stop']]

Unnamed: 0,text,lemma,pos,tag,dep,shape,is_alpha,is_stop
0,Apple,apple,PROPN,NNP,nsubj,Xxxxx,True,False
1,is,be,VERB,VBZ,aux,xx,True,False
2,looking,look,VERB,VBG,ROOT,xxxx,True,False
3,at,at,ADP,IN,prep,xx,True,False
4,buying,buy,VERB,VBG,pcomp,xxxx,True,False
5,U.K.,u.k.,PROPN,NNP,compound,X.X.,False,False
6,startup,startup,NOUN,NN,dobj,xxxx,True,False
7,for,for,ADP,IN,prep,xxx,True,False
8,$,$,SYM,$,quantmod,$,False,False
9,1,1,NUM,CD,compound,d,False,False


Most of the tags and labels look pretty abstract, and they vary between languages. spacy.explain() will show you a short description – for example, spacy.explain("VBZ") returns "verb, 3rd person singular present".

In [5]:
spacy.explain('NNP')

'noun, proper singular'

In [6]:
spacy.displacy.render(doc, jupyter=True)

# Dependency parsing

## Noun chunks

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund".

In [7]:
for chunk in doc.noun_chunks:
    print(chunk.text, '|',
          chunk.root.text, '|',
          chunk.root.dep_, '|',
          chunk.root.head.text)

Apple | Apple | nsubj | looking
U.K. startup | startup | dobj | buying


- Text: The original noun chunk text.
- Root text: The original text of the word connecting the noun chunk to the rest of the parse.
- Root dep: Dependency relation connecting the root to its head.
- Root head text: The text of the root token's head.

## Navigating the parse tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

In [8]:
for token in doc:
    print(token.text, '|',
          token.dep_, '|',
          token.head.text, '|',
          token.head.pos_, '|',
          [child for child in token.children])

Apple | nsubj | looking | VERB | []
is | aux | looking | VERB | []
looking | ROOT | looking | VERB | [Apple, is, at]
at | prep | looking | VERB | [buying]
buying | pcomp | at | ADP | [startup, for]
U.K. | compound | startup | NOUN | []
startup | dobj | buying | VERB | [U.K.]
for | prep | buying | VERB | [billion]
$ | quantmod | billion | NUM | []
1 | compound | billion | NUM | []
billion | pobj | for | ADP | [$, 1]


Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:

In [9]:
from spacy.symbols import nsubj, VERB

# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)

In [10]:
verbs

{looking}

If you try to match from above, you'll have to iterate twice: once for the head, and then again through the children:

In [11]:
# Finding a verb with a subject from above — less good
verbs = []
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break

In [12]:
verbs

[looking]

To iterate through the children, use the token.children attribute, which provides a sequence of Token  objects.

## Iterating around the local tree

A few more convenience attributes are provided for iterating around the local tree from the token. The Token.lefts  and Token.rights attributes provide sequences of syntactic children that occur before and after the token. Both sequences are in sentence order. There are also two integer-typed attributes, Token.n_rights  and Token.n_lefts , that give the number of left and right children.

In [13]:
doc = nlp(u'bright red apples on the tree')
assert [token.text for token in doc[2].lefts] == [u'bright', u'red']
assert [token.text for token in doc[2].rights] == ['on']
assert doc[2].n_lefts == 2
assert doc[2].n_rights == 1

You can get a whole phrase by its syntactic head using the Token.subtree  attribute. This returns an ordered sequence of tokens. You can walk up the tree with the Token.ancestors  attribute, and check dominance with Token.is_ancestor() .

In [14]:
doc = nlp(u'Credit and mortgage account holders must submit their requests')
root = [token for token in doc if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts, descendant.n_rights,
          [ancestor.text for ancestor in descendant.ancestors])

Credit nmod 0 2 ['account', 'holders', 'submit']
and cc 0 0 ['Credit', 'account', 'holders', 'submit']
mortgage conj 0 0 ['Credit', 'account', 'holders', 'submit']
account compound 1 0 ['holders', 'submit']
holders nsubj 1 0 ['submit']


Finally, the .left_edge and .right_edge attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree — so if you use it as the end-point of a range, don't forget to +1!

In [15]:
doc = nlp(u'Credit and mortgage account holders must submit their requests')
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
span.merge()
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Credit and mortgage account holders NOUN nsubj submit
must VERB aux submit
submit VERB ROOT submit
their ADJ poss requests
requests NOUN dobj submit


In [16]:
spacy.displacy.render(doc, style='dep', jupyter=True)

# Named entities

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

In [17]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

In [18]:
for ent in doc.ents:
    print(ent.text, '|',
          ent.start_char, '|',
          ent.end_char, '|',
          ent.label_)

Apple | 0 | 5 | ORG
U.K. | 27 | 31 | GPE
$1 billion | 44 | 54 | MONEY


- Text: The original entity text.
- Start: Index of start of entity in the Doc.
- End: Index of end of entity in the Doc.
- Label: Entity label, i.e. type.

In [19]:
?spacy.displacy.render

In [20]:
spacy.displacy.render(doc, style='ent', jupyter=True)

# Word vectors and similarity

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one.

In [21]:
tokens = nlp(u'dog cat banana')
for token1 in tokens:
    for token2 in tokens:
        print(token1, token2)
        print(token1.similarity(token2))

dog dog
1.0
dog cat
0.80168563
dog banana
0.24327643
cat dog
0.80168563
cat cat
1.0
cat banana
0.28154364
banana dog
0.24327643
banana cat
0.28154364
banana banana
1.0


In [22]:
nlp = spacy.load('en_core_web_lg')
tokens = nlp(u'dog cat banana sasquatch')
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
banana True 6.700014 False
sasquatch True 6.9789977 False


- Text: The original token text.
- has vector: Does the token have a vector representation?
- Vector norm: The L2 norm of the token's vector (the square root of the sum of the values squared)
- is OOV: Is the word out-of-vocabulary?

# Unigram probabilities

In [23]:
# For every token in doc_2, print log-probability of the word, estimated from counts from a large corpus 
for token in doc:
    print(token, ',', token.prob)

Apple , -10.153641700744629
is , -4.457748889923096
looking , -7.911639213562012
at , -5.763442516326904
buying , -9.383744239807129
U.K. , -14.15816879272461
startup , -12.337535858154297
for , -4.8801093101501465
$ , -7.450106620788574
1 , -7.639832973480225
billion , -10.603442192077637


# Rule-based matching

spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_, and flags (e.g. IS_PUNCT). The rule matcher also lets you pass in a custom callback to act on matches – for example, to merge entities and apply custom labels. You can also associate patterns with entity IDs, to allow some basic entity linking or disambiguation. To match large terminology lists, you can use the PhraseMatcher , which accepts Doc objects as match patterns.

In [24]:
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
# add match ID "HelloWorld" with no callback and one pattern
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher.add('HelloWorld', None, pattern)

doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)

In [25]:
matches

[(15578876784678163569, 0, 3)]

In [26]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # 'HelloWorld'
    span = doc[start:end]                    # the matched span

In [27]:
string_id

'HelloWorld'

In [28]:
span

Hello, world

If you need to match large terminology lists, you can also use the PhraseMatcher  and create Doc  objects instead of token patterns, which is much more efficient overall. The Doc patterns can contain single or multiple tokens.

In [29]:
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
patterns = [nlp(text) for text in terminology_list]
matcher.add('TerminologyList', None, *patterns)

doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
          u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)

In [30]:
matches

[(3766102292120407359, 2, 4),
 (3766102292120407359, 7, 9),
 (3766102292120407359, 19, 22)]