# Part 1

In this part we discuss the building blocks of spaCy. It helps to show these things before applying spaCy to some real data.

## Prep

First we'll need a video that shows how we boot up a new virtualenv with spaCy and JupyterLab. 

In particular, we'll need folks to run this command beforehand. 

```
python -m spacy download en_core_web_md
```

Once that's taken care of, we can go to the next vid. 

## Tokens

The goal here is to show that you can load a spaCy model into an `nlp` variable that can then turn text into a `Doc` object. This `Doc` object has tokens with properties that the `nlp`-model has predicted for you. Here's one example:

In [2]:
import spacy 

nlp = spacy.load("en_core_web_md")

In [3]:
doc = nlp("Hi, my name is Vincent. I like to write Python")
for token in doc:
    print(token, token.pos_)

Hi INTJ
, PUNCT
my PRON
name NOUN
is AUX
Vincent PROPN
. PUNCT
I PRON
like VERB
to PART
write VERB
Python PROPN


You might be tempted to think that a token is a word. But that's not exactly true. Notice how in the previous example the punctuation is also a token? Here's another interesting example. 

In [4]:
doc = nlp("Python isn't just a language, it's a community!")
for token in doc:
    print(token)

Python
is
n't
just
a
language
,
it
's
a
community
!


Notice how `n't` and `'s` are tokens here? It makes sense when you consider that `n't` basically means `not` and `'s` implies `is`. spaCy constructs these tokens internally using rules that depend on a language. These rules are different in English than they are for Dutch, just to name one example.

We won't focus on this too much, but it's good to keep in mind that a token doesn't always imply a word.

## More Token Properties

I've made a function that can highlight some more properties that spaCy provides in a neat table format. This is a subset of all the properties that spaCy can calculate (mention [these docs](https://spacy.io/usage/linguistic-features#pos-tagging)). 

In [12]:
from wasabi import table

def text_to_doctable(txt):
    doc = nlp(txt)
    header = ("text", "lemma", "pos", "ent", "shape", "punct", "morph")
    data = [(tok.text, tok.lemma_, tok.pos_, tok.ent_type_, tok.shape_, tok.is_punct, tok.morph) for tok in doc]
    formatted = table(data, header=header, divider=True)
    print(formatted)

text_to_doctable("Hello internet. My name is Vincent Warmerdam. I like to write Python")


text        lemma       pos     ent      shape   punct   morph                         
---------   ---------   -----   ------   -----   -----   ------------------------------
Hello       hello       INTJ             Xxxxx   False                                 
internet    internet    NOUN             xxxx    False   Number=Sing                   
.           .           PUNCT            .       True    PunctType=Peri                
My          my          PRON             Xx      False   Number=Sing|Person=1|Poss=Yes|PronType=Prs
name        name        NOUN             xxxx    False   Number=Sing                   
is          be          AUX              xx      False   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
Vincent     Vincent     PROPN   PERSON   Xxxxx   False   Number=Sing                   
Warmerdam   Warmerdam   PROPN   PERSON   Xxxxx   False   Number=Sing                   
.           .           PUNCT            .       True    PunctType=Peri             

In particular, one of the most useful properties that spaCy provides here is the "named entity". Notice that `Vincent` is a PERSON? This property is attached on the token, but we can also get it from the `doc`. 

In [16]:
doc = nlp("Hi, my name is Vincent Warmerdam. I live near Amsterdam and I like to write Python")
for ent in doc.ents:
    print(ent, ent.label_)

Vincent Warmerdam PERSON
Amsterdam GPE


Detecting these entities is something that spaCy is pretty renown for. These models have been trained on large datasets and can generalize to a bunch of use-cases.

## displaCy

There's loads more stuff that spaCy can detect for you, like the grammatical dependencies. Showing that is easiest with the internal `diplacy` tool.

In [17]:
from spacy import displacy 

doc = nlp("Hi, my name is Vincent. I like to write Python")
displacy.render(doc)

This tool can also be used to display entities nicely. 

In [18]:
doc = nlp("Hi, my name is Vincent. I like to write Python")
displacy.render(doc, style="ent")

One nice thing about this visual is that it's easy to show that a `PERSON` entity can contain more than one token. So let's take this.

In [19]:
doc = nlp("Hi, my name is Vincent Warmerdam. I like to write Python")
displacy.render(doc, style="ent")

One thing to remember is that our `nlp` model is a statistical model. So it won't be perfect, especially when the input text is different from the training data that spaCy used. One particular concequence of this is that capitalisation can matter a lot.

In [20]:
doc = nlp("Hi, my name is vincent. I like to write Python")
displacy.render(doc, style="ent")



## Document properties.

Sofar we've mainly shown properties on tokens. These are super useful, but there's also properties on the actual `Doc` itself that can be useful. The first is to show the sentences in the doc. 

In [21]:
doc = nlp("Hi, my name is Vincent. I like to write Python")
list(doc.sents)

[Hi, my name is Vincent., I like to write Python]

Similarily, you can also ask for all the noun chunks in a document.

In [22]:
doc = nlp("Star Wars is a very popular science fiction series.")
list(doc.noun_chunks)

[Star Wars, a very popular science fiction series]

Finally, you can also turn a `Doc` into a JSON representation. This is super useful if you want to expose the output of a spaCy model as an API.

In [23]:
doc = nlp("Hi, my name is Vincent. I like to write Python")
doc.to_json()

{'text': 'Hi, my name is Vincent. I like to write Python',
 'ents': [{'start': 15, 'end': 22, 'label': 'PERSON'}],
 'sents': [{'start': 0, 'end': 23}, {'start': 24, 'end': 46}],
 'tokens': [{'id': 0,
   'start': 0,
   'end': 2,
   'tag': 'UH',
   'pos': 'INTJ',
   'morph': '',
   'lemma': 'hi',
   'dep': 'intj',
   'head': 4},
  {'id': 1,
   'start': 2,
   'end': 3,
   'tag': ',',
   'pos': 'PUNCT',
   'morph': 'PunctType=Comm',
   'lemma': ',',
   'dep': 'punct',
   'head': 4},
  {'id': 2,
   'start': 4,
   'end': 6,
   'tag': 'PRP$',
   'pos': 'PRON',
   'morph': 'Number=Sing|Person=1|Poss=Yes|PronType=Prs',
   'lemma': 'my',
   'dep': 'poss',
   'head': 3},
  {'id': 3,
   'start': 7,
   'end': 11,
   'tag': 'NN',
   'pos': 'NOUN',
   'morph': 'Number=Sing',
   'lemma': 'name',
   'dep': 'nsubj',
   'head': 4},
  {'id': 4,
   'start': 12,
   'end': 14,
   'tag': 'VBZ',
   'pos': 'AUX',
   'morph': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin',
   'lemma': 'be',
   'dep': 'R

There are a bunch more too, but I wanted to mention these because they tend to be pretty useful in my day to day work.

## Data Structures 

Sofar we've seen three main data structures in spaCy, but it might be good to take the time to make that explicit. 

1. Document. Collection of tokens and estimated properties.
2. Token. Seperate tokens of text that make up the document.
3. Span. Sequence of tokens. Typically because they are part of a sentence, noun chunk or entity. Belongs to a document, is made up of tokens. 

This will be explained by making doodles, as well as this code.

In [24]:
doc = nlp("Hi. My name is Vincent.")
doc, type(doc)

(Hi. My name is Vincent., spacy.tokens.doc.Doc)

In [25]:
doc[0], type(doc[0])

(Hi, spacy.tokens.token.Token)

In [26]:
doc[:2], type(doc[:2])

(Hi., spacy.tokens.span.Span)

Note that spans can also contain a single token. But they are still a span. 

In [27]:
for ent in doc.ents:
    print(ent, type(ent))

Vincent <class 'spacy.tokens.span.Span'>


In [28]:
[tok for tok in doc]

[Hi, ., My, name, is, Vincent, .]

In [29]:
ent.start, ent.end, ent.start_char, ent.end_char

(5, 6, 15, 22)