## Data Structures (1) : Vocabs, Lexemes and StringStore

### Shared vocab and string store

* spaCy stores all shared data in a vocabulary, the Vocab
* This includes words, but also the labels schemes for tags and entities
* To save memory, all strings are encoded to hash IDs
* If a word occurs more than once, we don't need to save it every time
* Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store
* The string store is available as nlp dot vocab dot strings
* It's a lookup table that works in both directions
* You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs
* Hash IDs can't be reversed, though
* If a word in not in the vocabulary, there's no way to get its string
* That's why we always need to pass around the shared vocab.

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]

KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'. This usually refers to an issue with the `Vocab` or `StringStore`."

In [4]:
# Raises an error if we haven't seen the string before
string = nlp.vocab.strings[3197928453018144401]

KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'. This usually refers to an issue with the `Vocab` or `StringStore`."

* To get the hash for a string, we can look it up in nlp dot vocab dot strings
* To get the string representation of a hash, we can look up the hash
* A Doc object also exposes its vocab and strings

In [5]:
doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


In [6]:
doc = nlp("I love coffee")
print('hash value:', doc.vocab.strings['coffee'])

hash value: 3197928453018144401


### Lexemes : entries in the vocabulary

* Lexemes are context-independent entries in the vocabulary
* You can get a lexeme by looking up a string or a hash ID in the vocab
* Lexemes expose attributes, just like tokens
* They hold context-independent information about a word, like the text, or whether the the word consists of alphabetic characters
* Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context

In [7]:
doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']

In [8]:
# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


### Vocabs, lexemes and hashes

<img src="images/vocabs_lexemes_hashes.png"/>

## Strings to hashes

### Part 1

* Look up the string “cat” in nlp.vocab.strings to get the hash
* Look up the hash to get back the string

In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


### Part 2

* Look up the string label “PERSON” in nlp.vocab.strings to get the hash
* Look up the hash to get back the string

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("David Bowie is a PERSON")

# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


## Data Structures(2) : Doc, Span and Token

### The Doc object

* The Doc is one of the central data structures in spaCy
* It's created automatically when you process a text with the nlp object
* But you can also instantiate the class manually
* After creating the nlp object, we can import the Doc class from spacy dot tokens
* The Doc class takes three arguments: the shared vocab, the words and the spaces

In [11]:
# Create a nlp object
from spacy.lang.en import English
nlp = English()

In [12]:
# Import the Doc class
from spacy.tokens import Doc

In [13]:
# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

In [15]:
# Craete a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

### The Span object

<img src="images/span_object_2.png"/>

* A Span is a slice of a Doc consisting of one or more tokens
* The Span takes at least three arguments: the doc it refers to, and the start and end index of the span
* To create a Span manually, we can also import the class from spacy dot tokens
* We can then instantiate it with the doc and the span's start and end index, and an optional label argument
* The doc dot ents are writable, so we can add entities manually by overwriting it with a list of spans

In [17]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

In [18]:
# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

In [19]:
# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

In [20]:
# Create a span manually
span = Span(doc, 0, 2)

In [21]:
# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

In [22]:
# Add span to the doc.ents
doc.ents = [span_with_label]