Spacy:<br>
- open source NLP library<br>
- designed to handle NLP tasks with the most efficient implementation of common algorithms

NLTK:<br>
- Natural Language Toolkit, also open source

Spacy is usually more efficient but at the cost of not being able to choose the algorithm

Spacy does not include pre-created models for some applications such as sentiment analysis

In [5]:
# run to install correct stuff
!pip install spacy
!python -m spacy download en





Collecting en-core-web-sm==3.2.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')




In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm') # small version of language library, lg is for the large version
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [3]:
# create a document using the text below and the model defined above
doc = nlp(u'Tesla is looking at buying TeamTango, U.K, for $6 million') # will parse into tokens

In [4]:
for token in doc:
    print(f"{token.text:{10}} {token.pos_:{10}} {token.dep_}") # e.g. tells us Tesla is a proper noun and its syntactic dependency

Tesla      PROPN      nsubj
is         AUX        aux
looking    VERB       ROOT
at         ADP        prep
buying     VERB       pcomp
TeamTango  PROPN      dobj
,          PUNCT      punct
U.K        PROPN      appos
,          PUNCT      punct
for        ADP        prep
$          SYM        quantmod
6          NUM        compound
million    NUM        pobj


In [5]:
doc2 = nlp(u"Isn't is a word!")
for token in doc2:
    print(f"{token.text:{10}} {token.pos_:{10}} {token.dep_}")
# can see that it sees the negation of Isn't

Is         AUX        aux
n't        PART       neg
is         AUX        ROOT
a          DET        det
word       NOUN       attr
!          PUNCT      punct


In [6]:
# can see what each term means!
print(spacy.explain("nsubj"))

nominal subject


In [7]:
# Lemmatization, base form of the word
for a in nlp(u"Having"):
    print(a.lemma_)

have


In [8]:
doc3 = nlp(u"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.")

part = doc3[5:30] # from index 5 up until not including 30

print(type(doc3))
print(type(part))

<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.span.Span'>


In [9]:
doc4 = nlp(u"First sentence. This is the second. Third sentence.")
for sentence in doc4.sents:
    print(sentence)

First sentence.
This is the second.
Third sentence.


TOKENIZATION<br>
- Spacy will split on whitespace, prefix, suffix and exceptions
- Exceptions simply are special case rules to split into several tokens or prevent it from being split when certain rules are used

Tokens are the building blocks of a Doc object, they are needed to understand the meaning of the text and their relationship to one another

In [10]:
mystr = '"We\'re moving to U.S.A! My email is Tango@Team.com"'
doc5 = nlp(mystr)
for token in doc5:
    print(f"{token.text:{15}} {token.pos_:{10}} {token.dep_}")

"               PUNCT      punct
We              PRON       nsubj
're             AUX        aux
moving          VERB       ROOT
to              ADP        prep
U.S.A           PROPN      pobj
!               PUNCT      punct
My              PRON       poss
email           NOUN       nsubj
is              AUX        ROOT
Tango@Team.com  X          attr
"               PUNCT      punct


In [11]:
mystr2 = "Apple to build a Hong Kong factory for $6 billion"
doc6 = nlp(mystr2)
for entity in doc6.ents:
    print(f"{str(entity):{15}} {entity.label_:{10}} {str(spacy.explain(entity.label_))}")

print("\n")
for chunk in doc6.noun_chunks:
    print(chunk)

Apple           ORG        Companies, agencies, institutions, etc.
Hong Kong       GPE        Countries, cities, states
$6 billion      MONEY      Monetary values, including unit


Apple
a Hong Kong factory


In [12]:
doc7 = nlp(u"Apple is going to build a U.K. factory for $20 billion")
spacy.displacy.render(doc7, style='dep',jupyter=True,options={'distance':90})

In [13]:
doc8 = nlp(u"Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $X million!")
spacy.displacy.render(doc8, style='ent',jupyter=True)

In [14]:
doc9 = nlp(u"This is my testing sentence, visit 127.0.0.1:5000 to see the image!")
spacy.displacy.serve(doc9, style='dep')




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


STEMMING<br>
- essentially chop off letters from the end until the stem is reached
- English is very complex and stemming does not work very well

Spacy chooses lemmatization over stemming so we will use NLTK

For common stemming algorithm see Porter's Algorithm, Porter also created the snowball stemmer

In [78]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.6.7-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 8.6 MB/s eta 0:00:00
Installing collected packages: nltk
Successfully installed nltk-3.6.7




In [79]:
import nltk
from nltk.stem.porter import PorterStemmer

In [81]:
p_stemmer = PorterStemmer()
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly']
for word in words:
    print(word + " -----> " + p_stemmer.stem(word))

run -----> run
runner -----> runner
ran -----> ran
runs -----> run
easily -----> easili
fairly -----> fairli


In [82]:
from nltk.stem.snowball import SnowballStemmer

In [83]:
s_stemmer = SnowballStemmer(language='english')
for word in words:
    print(word + " -----> " + s_stemmer.stem(word))

run -----> run
runner -----> runner
ran -----> ran
runs -----> run
easily -----> easili
fairly -----> fair


LEMMATIZATION<br>
- considers full vocabulary of the language
- e.g. lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'
- lemmatization looks at surrounding text

In [101]:
doc10 = nlp(u"I am a runner running in a race because I love to run since I ran today.")
for token in doc10:
    print(f"{token.text:{12}} {token.pos_:{7}} {token.lemma:<{25}} {token.lemma_:{15}}")

I            PRON    4690420944186131903       I              
am           AUX     10382539506755952630      be             
a            DET     11901859001352538922      a              
runner       NOUN    12640964157389618806      runner         
running      VERB    12767647472892411841      run            
in           ADP     3002984154512732771       in             
a            DET     11901859001352538922      a              
race         NOUN    8048469955494714898       race           
because      SCONJ   16950148841647037698      because        
I            PRON    4690420944186131903       I              
love         VERB    3702023516439754181       love           
to           PART    3791531372978436496       to             
run          VERB    12767647472892411841      run            
since        SCONJ   10066841407251338481      since          
I            PRON    4690420944186131903       I              
ran          VERB    12767647472892411841      run     

STOP WORDS<br>
- words such as 'a' and 'the'
- can be filtered from the text to be processed

In [109]:
print(len(nlp.Defaults.stop_words)) # default stop words in spacy

print(nlp.vocab['is'].is_stop)
print(nlp.vocab['test'].is_stop)

326
True
False


In [111]:
# add a stop word
nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop = True

# remove a stop word
nlp.Defaults.stop_words.remove('btw')
nlp.vocab['btw'].is_stop = False

PHRASE MATCHING AND VOCAB

In [112]:
from spacy.matcher import Matcher

In [113]:
matcher = Matcher(nlp.vocab)

In [122]:
# find the keys for the dictionary in documentation, can also allow parts to occur * times etc (Regex)
pattern1 = [{'LOWER':'solarpower'}] # if change into lower case does it match solarpower
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT':True},{'LOWER':'power'}] # will find solar-power
pattern3 = [{'LOWER':'solar'}, {'LOWER':'power'}] # will detect solar power

In [123]:
matcher.add('SolarPowerMatcher', [pattern1, pattern2, pattern3])

doc11 = nlp(u"Solar Power is great. Solar-power and Solarpower")

found_matches = matcher(doc11)
print(found_matches) # prints id, start and stop index!

matcher.remove('SolarPowerMatcher')

[(6604624467252227415, 0, 2), (6604624467252227415, 5, 8), (6604624467252227415, 9, 10)]


In [134]:
from spacy.matcher import PhraseMatcher

In [135]:
matcher = PhraseMatcher(nlp.vocab)

In [139]:
with open('reaganomics.txt') as f:
    doc12 = nlp(f.read())

phrase = ['voodoo economics', 'supply-side economics', 'trickle-down economics']
phrase_patterns = [nlp(text) for text in phrase]

In [140]:
matcher.add('EconMatcher', phrase_patterns)

found_matches = matcher(doc12)

In [141]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc12[start:end]
    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2987 2991 trickle-down economics
