In [0]:
# installing spacy, spacy is added automatically to PATH, hence we can use cmd to access and then download
# the english "vocabulary" or the english model, a small one
!pip install spacy
!python -m spacy download en

In [3]:
# importing spacy and load the english model, basically the entire vocabulary.
# the _sm stands for small, about 50000 words, enough for basic stuffs
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a Doc object, this is going to get the processed text information via a pipeline
# by the nlp(). notice the u'', it is processing a unicode string and not a raw string
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

# Print each token separately, token as in elements of the text
for token in doc:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


This doesn't look very user-friendly, but right away we see some interesting things happen:
1. Tesla is recognized to be a Proper Noun, not just a word at the start of a sentence
2. U.S. is kept together as one entity (we call this a 'token')

As we dive deeper into spaCy we'll see what each of these abbreviations mean and how they're derived. We'll also see how spaCy can interpret the last three tokens combined `$6 million` as referring to ***money***.

___
# Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.

<img src="https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg" width="600">

We can check to see what components currently live in the pipeline. In later sections we'll learn how to disable components and add new ones as needed.

In [5]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f2d69d7db38>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f2d64f68288>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f2d64f682e8>)]

In [4]:
nlp.pipe_names

['tagger', 'parser', 'ner']

___
## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:

In [7]:
# tokens being annotated
doc2 = nlp(u"Tesla isn't   looking into startups anymore.")

for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is VERB aux
n't ADV neg
   SPACE 
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [0]:
doc2

Tesla isn't   looking into startups anymore.

In [0]:
# doc object supports indexing, its obvio based on tokoens
doc2[0]

Tesla

In [8]:
type(doc2[0])

spacy.tokens.token.Token

___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [0]:
doc2[0].pos_

'PROPN'

___
## Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

In [0]:
# everything is annotated within doc object
doc2[0].dep_

'nsubj'

To see the full name of a tag use `spacy.explain(tag)`

In [0]:
spacy.explain('PROPN')

'proper noun'

In [0]:
spacy.explain('nsubj')

'nominal subject'

___
## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [0]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

looking
look


In [9]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4])
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

looking
VERB
VBG / verb, gerund or present participle


In [0]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [0]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [10]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')
print(doc3)

Although commmonly attributed to John Lennon from his song "Beautiful Boy", the phrase "Life is what happens to us while we are making other plans" was written by cartoonist Allen Saunders and published in Reader's Digest in 1957, when Lennon was 17.


In [0]:
# notice again it isn't character based slicing like string slice.
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [0]:
type(life_quote)

spacy.tokens.span.Span

___
## Sentences


In [0]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [13]:
# spacy is smart enough to understand sentences
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [14]:
doc4[6].is_sent_start

True