In [69]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [2]:
# Create a Doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [7]:
for token in doc:
    print(token,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


___
# spaCy Objects

After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>spaCy also builds a companion **Vocab** object that we'll cover in later sections.<br>The **Doc** object that holds the processed text is our focus here.

In [8]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x17fb359c4a8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x17fb3692ac8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x17fb3692b28>)]

___
## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:

In [10]:
doc2 = nlp(u"Tesla isn't   looking into startups anymore.")

In [11]:
for token in doc2:
    print(token,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
   SPACE 
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [12]:
doc2[2]

n't

In [16]:
text="The first step in processing text is to split up all the component parts (words & punctuation) into tokens. These tokens are annotated inside the Doc object to contain descriptive information. We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:"

In [17]:
text

"The first step in processing text is to split up all the component parts (words & punctuation) into tokens. These tokens are annotated inside the Doc object to contain descriptive information. We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:"

In [21]:
doc3=nlp(text)

In [24]:
for token in doc3:
    print(token,token.pos_,token.dep_)

The DET det
first ADJ amod
step NOUN nsubj
in ADP prep
processing NOUN compound
text NOUN pobj
is AUX ROOT
to PART aux
split VERB xcomp
up ADP prt
all DET predet
the DET det
component NOUN compound
parts NOUN dobj
( PUNCT punct
words NOUN appos
& CCONJ cc
punctuation NOUN conj
) PUNCT punct
into ADP prep
tokens NOUN pobj
. PUNCT punct
These DET det
tokens NOUN nsubjpass
are AUX auxpass
annotated VERB ROOT
inside ADP prep
the DET det
Doc PROPN compound
object NOUN pobj
to PART aux
contain VERB xcomp
descriptive ADJ amod
information NOUN dobj
. PUNCT punct
We PRON nsubj
'll VERB aux
go VERB ROOT
into ADP prep
much ADV advmod
more ADJ amod
detail NOUN pobj
on ADP prep
tokenization NOUN pobj
in ADP prep
an DET det
upcoming ADJ amod
lecture NOUN pobj
. PUNCT punct
For ADP prep
now ADV pcomp
, PUNCT punct
let VERB ROOT
's PRON nsubj
look VERB ccomp
at ADP prep
another DET det
example NOUN pobj
: PUNCT punct


___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [25]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [26]:
life_cote=doc3[16:30]

In [27]:
life_cote

"Life is what happens to us while we are making other plans"

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [28]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [30]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [31]:
doc4[6]

This

In [33]:
doc4[6].is_sent_start

True

In [38]:
doc4[7].is_sent_start

# Tokenization
The first step in creating a `Doc` object is to break down the incoming text into component pieces or "tokens".

In [40]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [41]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [42]:
doc=nlp(mystring)

In [46]:
for token in doc:
    print(token.text, end=' | ')

" | We | 're | moving | to | L.A. | ! | " | 

-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

## Prefixes, Suffixes and Infixes
spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [47]:
text="We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!"

In [48]:
doc=nlp(text)

In [51]:
for token in doc:
    print(token.text, end='|')

We|'re|here|to|help|!|Send|snail|-|mail|,|email|support@oursite.com|or|visit|us|at|http://www.oursite.com|!|

## Tokens can be retrieved by index position and slice
`Doc` objects can be thought of as lists of `token` objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing:

In [52]:
text='It is better to give than to receive.'

In [53]:
doc=nlp(text)

In [55]:
doc[2]

better

In [57]:
doc[2:8]

better to give than to receive

In [59]:
doc[-5:]

give than to receive.

___
# Named Entities
Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [60]:
text='Apple to build a Hong Kong factory for $6 million'

In [61]:
doc=nlp(text)

In [68]:
for token in doc:
    print(token.text, end='|')

print('\n---------')

for ent in doc.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple|to|build|a|Hong|Kong|factory|for|$|6|million|
---------
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


---
# Noun Chunks
Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in [Sheb Wooley's 1958 song](https://en.wikipedia.org/wiki/The_Purple_People_Eater), a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

In [70]:
text='Autonomous cars shift insurance liability toward manufacturers.'

In [71]:
doc=nlp(text)

In [73]:
for chunk in doc.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [74]:
text='He was a one-eyed, one-horned, flying, purple people-eater.'

In [75]:
doc=nlp(text)

In [76]:
for chunk in doc.noun_chunks:
    print(chunk.text)

He
a one-eyed, one-horned, flying, purple people-eater


___
# Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

## Visualizing the dependency parse
Run the cell below to import displacy and display the dependency graphic

In [77]:
from spacy import displacy

In [78]:
text='Apple is going to build a U.K. factory for $6 million.'

In [79]:
text

'Apple is going to build a U.K. factory for $6 million.'

In [80]:
doc=nlp(text)

In [86]:
displacy.render(doc,style='dep',jupyter=True)

In [87]:
displacy.render(doc,style='ent',jupyter=True)

In [89]:
displacy.serve(doc,style='dep')

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


# Stemming

In [91]:
import nltk
from nltk.stem.porter import PorterStemmer

In [92]:
p_stem=PorterStemmer()

In [93]:
words = ['run','runner','running','ran','runs','easily','fairly']

In [94]:
for word in words:
    print(word+'------------>'+p_stem.stem(word))

run------------>run
runner------------>runner
running------------>run
ran------------>ran
runs------------>run
easily------------>easili
fairly------------>fairli


In [95]:
from nltk.stem.snowball import SnowballStemmer

In [99]:
s_stem=SnowballStemmer(language='english')

In [102]:
for word in words:
    print(word+'------------>'+s_stem.stem(word))

run------------>run
runner------------>runner
running------------>run
ran------------>ran
runs------------>run
easily------------>easili
fairly------------>fair


## Try it yourself!
#### Pass in some of your own words and test each stemmer on them. Remember to pass them as strings!

In [103]:
text='I am meeting him tomorrow at the meeting'

In [104]:
text

'I am meeting him tomorrow at the meeting'

In [105]:
for word in text.split():
    print(word+'--------->'+s_stem.stem(word))

I--------->i
am--------->am
meeting--------->meet
him--------->him
tomorrow--------->tomorrow
at--------->at
the--------->the
meeting--------->meet


# Lemmatization

In [107]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [108]:
text='I am a runner running in a race because I love to run since I ran today'

In [109]:
doc=nlp(text)

In [111]:
for token in doc:
    print(f'{token.text:{12}} {token.pos_:{10}} {token.lemma:{22}} {token.lemma_:{10}}')

I            PRON           561228191312463089 -PRON-    
am           AUX          10382539506755952630 be        
a            DET          11901859001352538922 a         
runner       NOUN         12640964157389618806 runner    
running      VERB         12767647472892411841 run       
in           ADP           3002984154512732771 in        
a            DET          11901859001352538922 a         
race         NOUN          8048469955494714898 race      
because      SCONJ        16950148841647037698 because   
I            PRON           561228191312463089 -PRON-    
love         VERB          3702023516439754181 love      
to           PART          3791531372978436496 to        
run          VERB         12767647472892411841 run       
since        SCONJ        10066841407251338481 since     
I            PRON           561228191312463089 -PRON-    
ran          VERB         12767647472892411841 run       
today        NOUN         11042482332948150395 today     


### Function to display lemmas
Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly.

In [117]:
def lem_func(text):
    doc=nlp(text)
    for token in doc:
        print(f'{token.text:{12}} {token.pos_:{12}} {token.lemma:{25}} {token.lemma_:{12}}')

In [120]:
abhra='I saw eighteen mice today!'

In [115]:
doc=nlp(text)

In [123]:
lem_func(abhra)

I            PRON                561228191312463089 -PRON-      
saw          VERB              11925638236994514241 see         
eighteen     NUM                9609336664675087640 eighteen    
mice         NOUN               1384165645700560590 mouse       
today        NOUN              11042482332948150395 today       
!            PUNCT             17494803046312582752 !           


In [124]:
text="That's an enormous automobile"

In [125]:
lem_func(text)

That         DET                4380130941430378203 that        
's           AUX               10382539506755952630 be          
an           DET               15099054000809333061 an          
enormous     ADJ               17917224542039855524 enormous    
automobile   NOUN               7211811266693931283 automobile  


# Stop Words

In [126]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [127]:
print(nlp.Defaults.stop_words)

{'call', 'forty', '’m', 'meanwhile', 'only', 'show', 'first', 'when', 'against', 'n’t', 'whereupon', 'any', 'almost', 'regarding', 'everywhere', 'on', 'yourselves', 'than', 'without', 'himself', 'does', 'whereby', 'my', 'otherwise', 'mostly', 'sixty', 'third', 'whose', 'few', '‘m', 'thereafter', 'beyond', 'say', 'seemed', 'his', 'therefore', 'whenever', 'further', 'but', 'nobody', 'cannot', 'less', 'have', '‘ve', 'besides', 'such', 'anyhow', 'amount', 'either', 'whatever', 'could', 'keep', 'although', 'or', 'yours', 'perhaps', '’d', 'an', 'formerly', 'did', 'done', 'wherein', 'through', 'anyway', 'it', 'may', 'where', 'nowhere', 'she', 'well', 'because', 'themselves', 'always', 'were', "n't", 'however', 'except', 'thereby', 'make', 'and', 'none', 'back', '‘s', 'same', 'should', '’s', 'hereupon', 'nothing', 'so', 'nine', 'fifteen', 'no', 'above', 'doing', 'ever', '’re', 'another', 'other', 'to', 'towards', 'must', 'name', '‘d', 'take', 'us', 'every', 'becoming', 'together', 'not', 'whom

In [128]:
len(nlp.Defaults.stop_words)

326

## To see if a word is a stop word

In [129]:
nlp.vocab['Abhra'].is_stop

False

## To add a stop word
There may be times when you wish to add a stop word to the default set. Perhaps you decide that `'btw'` (common shorthand for "by the way") should be considered a stop word.

In [130]:
nlp.Defaults.stop_words.add('Abhra')

In [131]:
nlp.vocab['Abhra'].is_stop='True'

In [132]:
nlp.vocab['Abhra'].is_stop

True

In [133]:
len(nlp.Defaults.stop_words)

327

## To remove a stop word
Alternatively, you may decide that `'beyond'` should not be considered a stop word.

In [134]:
nlp.Defaults.stop_words.remove('beyond')

In [135]:
nlp.vocab['beynod'].is_stop='False'

In [136]:
len(nlp.Defaults.stop_words)

326

In [137]:
nlp.vocab['beyond'].is_stop

False

# Vocabulary and Matching