# Tokenization


In [3]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

In [4]:
# Create a string that includes opening and closing quotation marks
mystring = "How's everything going?"
print(mystring)

How's everything going?


In [5]:
# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
    print(token.text, end=' | ')

How | 's | everything | going | ? | 

-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`


spaCy will isolate punctuation that does **not** form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [4]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


<font color=red>Note that the **exclamation points**, **comma**, and the **hyphen** in 'snail-mail' are assigned their own tokens, yet both the email address and website are preserved.</font>

In [8]:
doc3 = nlp('It tooks me $8.99/month to buy a Pycharm')

for t in doc3:
    print(t)

It
tooks
me
$
8.99
/
month
to
buy
a
Pycharm


<font color=red>Here the distance **unit** and **dollar sign** are assigned their own tokens, yet the dollar amount is preserved.</font>

In [10]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


<font color=red>Here the abbreviations for "Saint" and "United States" are both preserved.</font>

## Counting tokens

In [11]:
len(doc)

5

## Indexing tokens

In [12]:
doc5 = nlp(u'It is better to give than to receive.')

# Retrieve the third token:
doc5[2]

better

In [13]:
# Retrieve three tokens from the middle:
doc5[2:5]

better to give

In [14]:
# Retrieve the last four tokens:
doc5[-4:]

than to receive.

#### <font color=red>Token cannot be assigned

In [19]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

In [16]:
# Try to change "My dinner was horrible" to "My dinner was delicious"
doc6[3] = doc7[3]

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

___
# Named Entities


In [38]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')


Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [46]:
for ent in doc8.ents:
    print(f'{ent.text:{15}}{ent.label_:{10}}{spacy.explain(ent.label_):>{10}}')

Apple          ORG       Companies, agencies, institutions, etc.
Hong Kong      GPE       Countries, cities, states
$6 million     MONEY     Monetary values, including unit


<font color=red>Note how two tokens combine to form the entity `Hong Kong`, and three tokens combine to form the monetary entity:  `$6 million`</font>

In [47]:
len(doc8.ents)

3

For more info on **named entities** visit https://spacy.io/usage/linguistic-features#named-entities

---
# Noun Chunks
Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in [Sheb Wooley's 1958 song](https://en.wikipedia.org/wiki/The_Purple_People_Eater), a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

In [48]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [49]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


In [50]:
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
    print(chunk.text)

He
a one-eyed, one-horned, flying, purple people-eater


We'll look at additional noun_chunks components besides `.text` in an upcoming section.<br>For more info on **noun_chunks** visit https://spacy.io/usage/linguistic-features#noun-chunks

___
# Built-in Visualizers

built-in visualization tool **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

## Visualizing the dependency parse
Run the cell below to import displacy and display the dependency graphic

In [59]:
from spacy import displacy

doc = nlp(u'Shan Jiang is learning NLP on Udemy')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 120})

The optional `'distance'` argument sets the distance between tokens. If the distance is made too small, text that appears beneath short arrows may become too compressed to read.

## Visualizing the entity recognizer

In [72]:
doc = nlp(u'Shan bought an box of Diet Coke from Amazon for $19')
displacy.render(doc, style='ent', jupyter=True)

___
## Creating Visualizations Outside of Jupyter
If you're using another Python IDE or writing a script, you can choose to have spaCy serve up html separately:

In [73]:
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [27/May/2020 09:41:00] "GET / HTTP/1.1" 200 3394
127.0.0.1 - - [27/May/2020 09:41:01] "GET /favicon.ico HTTP/1.1" 200 3394


Shutting down server on port 5000.


<font color=blue>**After running the cell above, click the link below to view the dependency parse**:</font>

http://127.0.0.1:5000
<br><br>
<font color=red>**To shut down the server and return to jupyter**, interrupt the kernel either through the **Kernel** menu above, by hitting the black square on the toolbar, or by typing the keyboard shortcut `Esc`, `I`, `I`</font>