# Natural Language Processing
Using Spacy and NLTK

NLTK vs spacy <br>
For many common NLP tasks, Spacy is much faster and more efficiency, at the cost of the user not bring able to choos algorithmic implementation <br>
NLTK 2001 <br>
Spacy 2015 <br>

However, Spacy does not include pre-created models for some applications, like Sentiment Analysis which is typically easier to perofrm with NLTK

# Lesson 1 Basics

In [27]:
import spacy 
# English Core Web Small Library/Model
nlp=spacy.load('en_core_web_sm')

In [28]:
# Processed Text
doc= nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [29]:
for token in doc:
    print(token.text,':',token.pos,token.pos_,token.dep_)

Tesla : 95 PROPN nsubj
is : 99 VERB aux
looking : 99 VERB ROOT
at : 84 ADP prep
buying : 99 VERB pcomp
U.S. : 95 PROPN compound
startup : 91 NOUN dobj
for : 84 ADP prep
$ : 98 SYM quantmod
6 : 92 NUM compound
million : 92 NUM pobj


In [30]:
nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x1af87843248>),
 ('parser', <spacy.pipeline.DependencyParser at 0x1af8703f5e8>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x1af8703fb88>)]

In [31]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [32]:
doc2= nlp(u"Tesla isn't looking into startups anymore.")
for token in doc2:
    print(token.text,':',token.pos,token.pos_,token.dep_)

Tesla : 95 PROPN nsubj
is : 99 VERB aux
n't : 85 ADV neg
looking : 99 VERB ROOT
into : 84 ADP prep
startups : 91 NOUN pobj
anymore : 85 ADV advmod
. : 96 PUNCT punct


In [33]:
# Part of Speech
doc2[0].pos_

'PROPN'

In [34]:
# Stemming/Lemmization: Bringing the word closer to root word
doc2[0].lemma_

'tesla'

In [35]:
# General Text
doc2[0].text

'Tesla'

In [36]:
# Shape of word> Bold X and small x
doc2[0].shape_

'Xxxxx'

In [37]:
#Part of Speech Tag
doc2[0].tag_

'NNP'

In [38]:
# Is this word a alpha-numeric word
doc2[0].is_alpha

True

In [39]:
# Is this word a stop word in English Language
doc2[0].is_stop

False

 SPAN of the Paragraph / Section

In [40]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [41]:
life_quote=doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [42]:
type(life_quote)

spacy.tokens.span.Span

In [43]:
doc4=nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [44]:
for sentence in doc4.sents:
    print(sentence)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [45]:
doc4[6]

This

In [46]:
doc4[6].is_sent_start

True

In [47]:
doc4[7].is_sent_start

# Lesson 2 Tokenization, NamedEntity, NounChunks 

In [48]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [49]:
mystring='"We\'re moving to L.A.!"'
mystring

'"We\'re moving to L.A.!"'

In [50]:
print(mystring)

"We're moving to L.A.!"


In [51]:
doc=nlp(mystring)

In [52]:
for token in doc:
    print(token.text)

"
We
're
moving
to
L.A.
!
"


In [53]:
doc2=nlp(u"We're here to help! Send snail-mail,email support@ursite.com or visit us at http://www.oursite.com!")

In [54]:
doc2[2]

here

In [55]:
for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@ursite.com
or
visit
us
at
http://www.oursite.com
!


In [56]:
doc3=nlp(u"A 5km NYC cab ride costs $10.30")

In [57]:
for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


In [58]:
doc4=nlp(u"Let's visit St. Louis in the U.S. next year.")

In [59]:
for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


In [60]:
len(doc4)

11

In [61]:
len(doc4.vocab)

57852

In [62]:
doc5=nlp(u"It is better to give than receive.")

In [63]:
print(doc5[0])
print(doc[3:6])

It
moving to L.A.


In [64]:
doc6=nlp(u"Apple to build a Hong Kong factor for $6 million.")

In [65]:
for token in doc6:
    print(token.text,end=' | ')

Apple | to | build | a | Hong | Kong | factor | for | $ | 6 | million | . | 

In [66]:
# Proper Nouns / Named entities
for entity in doc6.ents:
    print(entity)
    print(entity.label_)
    print(str(spacy.explain(entity.label_)))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




In [67]:
# Nouned Chunks

doc7=nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

In [68]:
for chunk in doc7.noun_chunks:
    print(chunk)

Autonomous cars
insurance liability
manufacturers


# Lesson 3 Tokenization Visualized

In [69]:
# Built in visualizer
from spacy import displacy

In [70]:
doc8=nlp(u"Apple is going to build a U.K. Factor for $6 miillion.")

In [71]:
displacy.render(doc8,style='dep',jupyter=True,options={'distance':80})

In [72]:
doc9=nlp(u"Over the last quarter Apple solded nearly 20 thousand iPods for a profit of $6 million.")

In [73]:
displacy.render(doc9,style='ent',jupyter=True)

In [74]:
doc = nlp(u"This is a sentence.")
displacy.serve(doc,style='dep')

# http://127.0.0.1:5000/ Check on this


[93m    Serving on port 5000...[0m
    Using the 'dep' visualizer


    Shutting down server on port 5000.



**Options**
- Zoom in out
- Size
- Background color
- Styling

In [75]:
import nltk
from nltk.stem import PorterStemmer,SnowballStemmer

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


In [76]:
p_stemmer=PorterStemmer()

In [77]:
words=['run','runner','ran','runs','easily','fairly','fairness']

In [78]:
for word in words:
    print(word +'------>'+p_stemmer.stem(word))

run------>run
runner------>runner
ran------>ran
runs------>run
easily------>easili
fairly------>fairli
fairness------>fair


In [79]:
s_stemmer=SnowballStemmer(language='english')
for word in words:
    print(word +'------>'+s_stemmer.stem(word))

run------>run
runner------>runner
ran------>ran
runs------>run
easily------>easili
fairly------>fair
fairness------>fair


In [80]:
words=['generous','generation','generously','generate']
print("Porter Stemmer")
for word in words:
    print(word +'------>'+p_stemmer.stem(word))
    
print("\nSnowball Stemmer")
for word in words:
    print(word +'------>'+s_stemmer.stem(word))

Porter Stemmer
generous------>gener
generation------>gener
generously------>gener
generate------>gener

Snowball Stemmer
generous------>generous
generation------>generat
generously------>generous
generate------>generat


# Lesson 4 Lemmization

- Considers a language's full vocab to apply a morphological analysis to words
- Lemma for 'was' is 'be'
- Lemma for 'mice'is 'mouse'

In [81]:
import spacy 

In [82]:
nlp=spacy.load('en_core_web_sm')

In [83]:
doc1=nlp(u'I am a runner running in a race because I love to run since I ran today.')

In [84]:
for i in doc1:
    print(i.text,'\t',i.pos_,'\t',i.lemma,'\t',i.lemma_)

I 	 PRON 	 561228191312463089 	 -PRON-
am 	 VERB 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 ADP 	 16950148841647037698 	 because
I 	 PRON 	 561228191312463089 	 -PRON-
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 ADP 	 10066841407251338481 	 since
I 	 PRON 	 561228191312463089 	 -PRON-
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today
. 	 PUNCT 	 12646065887601541794 	 .


In [85]:
# Using function and f string literals
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:{20}} {token.lemma_:{12}}')

In [86]:
show_lemmas(doc1)

I            PRON     561228191312463089 -PRON-      
am           VERB   10382539506755952630 be          
a            DET    11901859001352538922 a           
runner       NOUN   12640964157389618806 runner      
running      VERB   12767647472892411841 run         
in           ADP     3002984154512732771 in          
a            DET    11901859001352538922 a           
race         NOUN    8048469955494714898 race        
because      ADP    16950148841647037698 because     
I            PRON     561228191312463089 -PRON-      
love         VERB    3702023516439754181 love        
to           PART    3791531372978436496 to          
run          VERB   12767647472892411841 run         
since        ADP    10066841407251338481 since       
I            PRON     561228191312463089 -PRON-      
ran          VERB   12767647472892411841 run         
today        NOUN   11042482332948150395 today       
.            PUNCT  12646065887601541794 .           


# Lesson 5 : Stop Words ( Spacy has 305 Stop Words )

In [87]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [88]:
print(nlp.Defaults.stop_words)

{'side', 'they', 'on', 'ourselves', 'beside', 'should', 'else', 'see', 'anyone', 'he', 'empty', 'a', 'ca', 'how', 'must', 'many', 'other', 'us', 'while', 'there', 'third', 'either', 'are', 'what', 'or', 'otherwise', 'front', 'somewhere', 'was', 'had', 'whereafter', 'please', 'first', 'once', 'may', 'anyhow', 'doing', 'did', 'where', 'none', 'became', 'here', 'nothing', 'twelve', 'her', 'between', 'move', 'ever', 'anyway', 'within', 'yourself', 'full', 'does', 'regarding', 'eight', 'same', 're', 'if', 'something', 'keep', 'just', 'seemed', 'when', 'nor', 'some', 'amongst', 'whatever', 'these', 'do', 'but', 'been', 'latter', 'against', 'hence', 'into', 'their', 'make', 'whoever', 'though', 'through', 'sometimes', 'thus', 'any', 'this', 'until', 'quite', 'anything', 'five', 'very', 'no', 'most', 'using', 'with', 'seeming', 'towards', 'another', 'forty', 'everything', 'nine', 'the', 'beforehand', 'for', 'fifty', 'whither', 'six', 'our', 'hundred', 'next', 'yet', 'unless', 'less', 'across',

In [89]:
print(len(nlp.Defaults.stop_words))

305


In [90]:
# Checking if word is a stop word
nlp.vocab['am'].is_stop

True

In [91]:
nlp.vocab['Tesla'].is_stop

False

In [92]:
# Adding into Stop word list
nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop = True

In [93]:
nlp.vocab['btw'].is_stop

True

In [94]:
print(len(nlp.Defaults.stop_words))

306


In [95]:
# Removing Stop word from list
nlp.Defaults.stop_words.remove('btw')
nlp.vocab['btw'].is_stop = False

In [96]:
print(nlp.vocab['Tesla'].is_stop)
print(len(nlp.Defaults.stop_words))

False
305


# Lesson 6 : Phase Matching. Pattern Matching Powerful REGEX

In [97]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [98]:
# Rule based Matching : Matcher
from spacy.matcher import Matcher

mymatcher=Matcher(nlp.vocab)

## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

In [99]:
#Solarpower
pattern1 = [{'LOWER':'solarpower'}]

#Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]

#Solar power
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

In [100]:
mymatcher.add('SolarPower',None,pattern1,pattern2,pattern3)

In [101]:
doc=nlp(u'The Solar Power industry continues to grow a solarpower increases. Solar-power is amazing.')

In [102]:
found_matches = mymatcher(doc)

In [103]:
print(found_matches)
# Matching id, start token index, end token index

[(8656102463236116519, 1, 3), (8656102463236116519, 8, 9), (8656102463236116519, 11, 14)]


In [104]:
for id,start,end in found_matches:
    stringid=nlp.vocab.strings[id]
    span=doc[start:end]
    print(f'{id:{20}} {stringid:{20}} {span.text:{20}}')

 8656102463236116519 SolarPower           Solar Power         
 8656102463236116519 SolarPower           solarpower          
 8656102463236116519 SolarPower           Solar-power         


In [105]:
#Removing a match
mymatcher.remove('SolarPower')

In [106]:
#Solarpower solarpower
pattern1=[{'LOWER':'solarpower'}]

#Solar-power Solar.power Solar?power
pattern2=[{'LOWER':'solar'},{'IS_PUNCT':True,'OP':'*'},{'LOWER':'power'}]

In [107]:
mymatcher.add('SolarPower',None,pattern1,pattern2)

In [108]:
doc=nlp(u'Solar--power is solarpower yay!')

In [109]:
found_matches1=mymatcher(doc)

In [110]:
print(found_matches1)

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 5)]


In [111]:
for id,start,end in found_matches1:
    stringid=nlp.vocab.strings[id]
    span=doc[start:end]
    print(f'{id:{20}} {stringid:{20}} {span.text:{20}}')

 8656102463236116519 SolarPower           Solar--power        
 8656102463236116519 SolarPower           solarpower          


In [112]:
Doc2=nlp(u"My name is Rajat Dhawan hello.")

pattern1=[{'LOWER':'rajat'},{'LOWER':'dhawan'}]
mymatcher.add('RajatDhawan',None,pattern1)

foundmatches=mymatcher(Doc2)

for id,start,end in foundmatches:
    stringid=nlp.vocab.strings[id]
    span=Doc2[start:end]
    print(f'{id:{20}} {stringid:{20}} {span.text:{20}}')

In [113]:
Doc2=nlp(u"My name is Rajat Dhawan hello.")
print(mymatcher(Doc2))

[(7768932714519943303, 3, 5)]


In [114]:
Doc2=nlp(u"My name is Rajatz Dhawan hello.")
print(mymatcher(Doc2))

[]


# Lesson 7 : Matching Part 2

In [115]:
from spacy.matcher import PhraseMatcher

In [116]:
myphrasematch=PhraseMatcher(nlp.vocab)

In [117]:
with open('../Resources/UPDATED_NLP_COURSE/TextFiles/reaganomics.txt') as file:
    doc3=nlp(file.read())

In [118]:
doc3.text

'REAGANOMICS\nhttps://en.wikipedia.org/wiki/Reaganomics\n\nReaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.\n\nThe four pillars of Reagan\'s economic policy were to reduce the growth of government spending, reduce the federal income tax and capital gains tax, reduce government regulation, and tighten the money supply in order to reduce inflation.[2]\n\nThe results of Reaganomics are still debated. Supporters point to the end of stagflation, stronger GDP growth, and an entrepreneur revolution in the decades that followed.[3][4] Critics point to the widening income gap, an atmosphere of greed, and the national debt tripling in eight years which ultimately reverse

In [119]:
phraselist=['voodoo economics','supply-side economics','trickle-down economics','free-market economics']

In [120]:
phrase_patterns=[nlp(text) for text in phraselist]

In [121]:
myphrasematch.add('EconMatcher',None,*phrase_patterns)

In [122]:
found_matches=myphrasematch(doc3)

In [123]:
found_matches

[(3680293220734633682, 41, 45),
 (3680293220734633682, 49, 53),
 (3680293220734633682, 54, 56),
 (3680293220734633682, 61, 65),
 (3680293220734633682, 673, 677),
 (3680293220734633682, 2985, 2989)]

In [124]:
for id,start,end in found_matches:
    stringid=nlp.vocab.strings[id]
    span=doc3[start:end]
    print(f'{id:{20}} {stringid:{20}} {span.text:{20}}')

 3680293220734633682 EconMatcher          supply-side economics
 3680293220734633682 EconMatcher          trickle-down economics
 3680293220734633682 EconMatcher          voodoo economics    
 3680293220734633682 EconMatcher          free-market economics
 3680293220734633682 EconMatcher          supply-side economics
 3680293220734633682 EconMatcher          trickle-down economics
