# Natural Language Processing

## Tokenization

In [13]:
# Import

import spacy # import the Spacy library

# Load language library

nlp = spacy.load('en_core_web_sm')

In [14]:
mystring = "'We\'re moving to L.A.!'"
print(mystring)

'We're moving to L.A.!'


In [15]:
doc = nlp(mystring)

for token in doc:

  print(token.text, end = ' | ')


# Prefixes, Suffixes and Infixes
# spaCy will isolate punctuation that does *not* form an integral part of a word.

# Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token.
# However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

' | We | 're | moving | to | L.A. | ! | ' | 

In [16]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:

    print(t, end = ' | ')

# Note that the exclamation points, comma, and the hyphen in 'snail-mail' are assigned their own tokens,
# yet both the email address and website are preserved.

We | 're | here | to | help | ! | Send | snail | - | mail | , | email | support@oursite.com | or | visit | us | at | http://www.oursite.com | ! | 

In [17]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:

    print(t)

# Here the distance unit and dollar sign are assigned their own tokens,
# yet the dollar amount is preserved.

A
5
km
NYC
cab
ride
costs
$
10.30


In [18]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:

    print(t)

# Here the abbreviations for "Saint" and "United States" are both preserved.

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


## Counting Tokens

In [19]:
len(doc)

8

## Counting Vocal Entries

In [20]:
len(doc4.vocab)

# This number changes based on the language library loaded at the start, and any
# new lexemes introduced to the `vocab` when the `Doc` was created

794

## Tokens cannot be re-assigned

In [21]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

# Try to change "My dinner was horrible" to "My dinner was delicious"

doc6[3] = doc7[3]

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

## Named Entities

Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [24]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:

    print(token.text, end = ' | ')

print('\n----')

for ent in doc8.ents:

    print(ent.text + ' - '+ ent.label_ + ' - ' + str(spacy.explain(ent.label_)))

# Note how two tokens combine to form the entity `Hong Kong`,
# and three tokens combine to form the monetary entity:  `$6 million`

# Named Entity Recognition (NER) is an important machine learning tool
# applied to Natural Language Processing.

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 
----
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


### Noun Chunks

Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in [Sheb Wooley's 1958 song](https://en.wikipedia.org/wiki/The_Purple_People_Eater), a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

In [25]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:

    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [26]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


In [28]:
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:

    print(chunk.text)

He
a one-eyed, one-horned, flying, purple people-eater
