In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

**Vocab**   :   stores data shared across multiple documents.

**String store** :   lookup table in both directions

To save memory, spaCy encodes all strings to hash values. Strings are only stored once in the StringStore via nlp.vocab.strings

In [6]:
coffee_hash = nlp.vocab.strings["coffee"]
coffee_hash

3197928453018144401

In [7]:
#Hashes can't be reversed – that's why we need to provide the shared vocab
# Raises an error if we haven't seen the string before
string = nlp.vocab.strings[3197928453018144401]

KeyError: ignored

In [8]:
doc = nlp("I love coffee")
print("hash value:", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


In [9]:
#doc can expose vocabs too
doc.vocab.strings["coffee"]

3197928453018144401

**Lexeme** object is an entry in the vocabulary

It only contain non context specific

In [13]:
doc = nlp("I love coffee too.")
lexeme = nlp.vocab["too"]

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha, lexeme.is_stop)

too 12286903790479710773 True True


The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually

In [14]:
# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

In [16]:
doc.text

'Hello world!'

In [17]:
# Import the Doc and Span classes
from spacy.tokens import Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words, spaces)
print(doc.text)

I like David Bowie


In [18]:
# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

David Bowie PERSON
[('David Bowie', 'PERSON')]


# Good Practice

In [19]:
doc = nlp("Berlin looks like a nice city")

In [25]:
%%time

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Berlin
CPU times: user 178 µs, sys: 0 ns, total: 178 µs
Wall time: 154 µs


In [26]:
%%time
# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin
CPU times: user 108 µs, sys: 0 ns, total: 108 µs
Wall time: 113 µs


* Always convert the results to strings as late as possible, and try to use native token attributes to keep things consistent.*