## Using spaCy: Part 1
##### By Ruben SeoaneB
Based on official spaCy documentation at: https://spacy.io/usage/spacy-101

### Linguistic annotations
Linguistic annotations provide insight into a text grammatical structure, including word types, like POS, and relationships between words.

In [1]:
# Loading models
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Suki raises $20M to create a voice assistant for doctors')
for token in doc:
    print(token.text, token.pos_, token.dep_)

Suki PROPN nsubj
raises VERB ROOT
$ SYM quantmod
20 NUM compound
M NOUN dobj
to PART aux
create VERB advcl
a DET det
voice NOUN compound
assistant NOUN dobj
for ADP prep
doctors NOUN pobj


### Tokenization
The process of segmenting a text into words, punctuation and so.

In [2]:
for token in doc:
    print(token.text)

Suki
raises
$
20
M
to
create
a
voice
assistant
for
doctors


An exmple on how the ruleset works:
![title](https://spacy.io/assets/img/tokenization.svg)

### POS tags and dependencies
After implementing tokenization, spaCy can **parse** and **tag** a given document. here spaCy, by statistical means, is able to make a prediction for the tag or label that applies given the context.

In [3]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
         token.shape_, token.is_alpha, token.is_stop)

Apple apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


- **Text:** The original word text.
- **Lemma:** The base form of the word.
- **POS:** The simple part-of-speech tag.
- **Tag:** The detailed part-of-speech tag.
- **Dep:** Syntactic dependency, i.e. the relation between tokens.
- **Shape:** The word shape – capitalisation, punctuation, digits.
- **is alpha:** Is the token an alpha character?
- **is stop:** Is the token part of a stop list, i.e. the most common words of the language?

#### Understanding Tags and Labels
Using _**spacy.explain()**_ will show a description.
Ex: **_spacy.explain("VBZ")_** returns "verb, 3rd person singular present".

In [5]:
# We can visualize our sentence with displaCy visualizer
from spacy import displacy

displacy.render(doc, style='dep', jupyter=True, options={'distance':90})

### Named Entities

In [6]:
for ent in doc.ents:
    print(ent.text, ent.start_char ,ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [7]:
# Using displaCy again:
from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

### Word Vectors and Similarity
Predicting similarity is useful for building recommendation systems or flagging duplicates.

In [10]:
tokens = nlp(u'house apartment boat')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

house house 1.0
house apartment 0.58161336
house boat 0.47614244
apartment house 0.58161336
apartment apartment 1.0
apartment boat 0.5902201
boat house 0.47614244
boat apartment 0.5902201
boat boat 1.0


#### IMPORTANT
spaCy's small models (packages ending in sm) **don't come with word vectors**, instead include context-sensitive **tensors**.

In [None]:
# To download models that use real word vectors, we need to download a larger model:
python -m spacy download en_core_web_lg

In [3]:
import spacy

nlp = spacy.load('en_core_web_lg')
tokens = nlp(u'dog cat banana skjdsdk')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
banana True 6.700014 False
skjdsdk False 0.0 True


- **Text:** The original token text.
- **has vector:** Does the token have a vector representation?
- **Vector norm:** The L2 norm of the token's vector (the square root of the sum of the values squared)
- **OOV:** Out-of-vocabulary

### Pipelines
When calling **_nlp_** on a text, spaCy tokenizes the text to create a **_Doc_** object. The _doc_ is processed in different stepes calles the **processing pipeline**. 
![title](https://spacy.io/assets/img/pipeline.svg)

- **Name:** ID of the pipeline component.
- **Component:** spaCy's implementation of the component.
- **Creates:** Objects, attributes and properties modified and set by the component.

### Vocab, hashes and lexemes
When possible, spaCy stores data in a vocabulary, to be **shared by multiple documents**. SpaCy encodes all strings to **hash values**, as well as entity labels like "ORG" and POS tags like "VERB". INternally, spaCy only operates with hash values.
- **Token:** A word, punctuation mark etc. in context, including its attributes, tags and dependencies.
- **Lexeme:** A "word type" with no context. Includes the word shape and flags, e.g. if it's lowercase, a digit or punctuation.
- **Doc:** A processed container of tokens in context.
- **Vocab:** The collection of lexemes.
- **StringStore:** The dictionary mapping hash values to strings, for example 3197928453018144401 → "coffee".

![title](https://spacy.io/assets/img/vocab_stringstore.svg)

The object _**StringStore**_ acts as a lookup table that works bidirectionally, you can input a string to get a hash value, or input a hash to get its string. This is useful when we encounter one word in multiple documents and across multiple contexts, as storing the string will take too much memory space.

In [5]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
print(doc.vocab.strings[u'coffee'])
print(doc.vocab.strings[3197928453018144401])

3197928453018144401
coffee


**_Lexemes_** contain the _context independent_ information about a word. Its spelling and wheter it consist or not of alphabetic charachters won't change, thus the hash value will remain the same.

In [6]:
import spacy

for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
         lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
coffee 3197928453018144401 xxxx c fee True False False en


- **Text:** The original text of the lexeme.
- **Orth:** The hash value of the lexeme.
- **Shape:** The abstract word shape of the lexeme.
- **Prefix:** By default, the first letter of the word string.
- **Suffix:** By default, the last three letters of the word string.
- **is alpha:** Does the lexeme consist of alphabetic characters?
- **is digit:** Does the lexeme consist of digits?

Hashes **cannot be reversed**, you cannot resolve a hash value back into a string if it's not in the dictionary. You have to make sure all objects created have access to the same vocabulary.

In [7]:
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee') #Original doc
print(doc.vocab.strings[u'coffee'])
print(doc.vocab.strings[3197928453018144401])

3197928453018144401
coffee


In [8]:
empty_doc = Doc(Vocab()) #New doc with empty Vocab
'''If we print(empty_doc.vocab.strings[3197928453018144401])
it will raise an error'''

empty_doc.vocab.strings.add(u"coffee") #Adds 'coffee' and generates hash
print(empty_doc.vocab.strings[3197928453018144401])

coffee


In [9]:
new_doc = Doc(doc.vocab) #Creates new doc with the first doc's vocab
print(new_doc.vocab.strings[3197928453018144401])

coffee


### Serialization
If you are modifying the pipeline, vocabulary, vectors or entities, or made updates to the model, you will need to **save your progress**, like everything contained in your _nlp_ object.
The above means you need to translate the contents and structure of the object into a format like a file or byte string, this process is called serialization, which come as **built-in methods** in spaCy.

##### Why Saving the Vocab
Saving the vocabulary with the Doc is important, because the Vocab holds the context-independent information about the words, tags and labels, and their **hash values**. If the Vocab wasn't saved with the Doc, spaCy wouldn't know how to resolve those IDs back to strings.

In [None]:
text = open('suki_article.txt', 'r').read()
doc = nlp(text)
doc.to_disk('/suki_article.bin')

In [15]:
# To look it up
from spacy.tokens import Doc #To create empty doc
from spacy.vocab import Vocab # To create an empty vocab

doc = Doc(Vocab()).from_disk('/suki_article.bin')

### Training
- **Training data:** Examples and their annotations.
- **Text:** The input text the model should predict a label for.
- **Label:** The label the model should predict.
- **Gradient:** Gradient of the loss function calculating the difference between input and expected output.
    ![title](https://spacy.io/assets/img/training.svg)