# Installation and Setup

Installation is a two-step process. First, install spaCy using either conda or pip. Next, download the specific model you want, based on language.<br> For more info visit https://spacy.io/usage/

### 1. From the command line or terminal:
> `conda install -c conda-forge spacy`
> <br>*or*<br>
> `pip install -U spacy`

> ### Alternatively you can create a virtual environment:
> `conda create -n spacyenv python=3 spacy=2`

### 2. Next, also from the command line (you must run this as admin or use sudo):

> `python -m spacy download en`

- this downloads the english language library which allows spacy to be so quick

> ### If successful, you should see a message like:

> **`Linking successful`**<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\en_core_web_sm -->`<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\spacy\data\en`<br>
> ` `<br>
> `    You can now load the model via spacy.load('en')`


In [1]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm') #loading a model

# Create a Doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


In [2]:
nlp.pipeline #looking at our pipeline, tagging, parsing, ner

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1a4c5036b00>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1a4c51b6588>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1a4c51b65e8>)]

In [3]:
nlp.pipe_names #getting just the parts of the pipeline

['tagger', 'parser', 'ner']

In [5]:
#tokenization, splitting up the words into tokens
doc2 = nlp(u"Tesla isn't looking into startups anymore.") #have to use "" not ''

In [6]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is VERB aux
n't ADV neg
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [7]:
# we can index and subset these tokens too
doc2[0]

Tesla

__We can use a variety of methods from our doc such as .pos_, .dep_, etc. here is a table of methods in addition to these__

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [10]:
#we can also subset large doc strings, called 'spanning'
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [11]:
life_quote = doc3[16:30]

print(life_quote)

"Life is what happens to us while we are making other plans"


## Tokenization

In [14]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"' #'\ keeps the string from stopping there'
print(mystring)

"We're moving to L.A.!"


The idea here is that Spacy is really smart! It can recognize when words are supposed to stick together, when to separate punctuation, etc. Below are some examples of how it can recognize complex strings and tokenize them accordingly

In [15]:
doc = nlp(mystring)

for token in doc:
    print(token.text)

"
We
're
moving
to
L.A.
!
"


In [16]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

In [19]:
for token in doc2:
    print(token)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [22]:
doc3 = nlp(u"A 5km New York City cab ride costs $4.50")

In [23]:
for t in doc3:
    print(t)

A
5
km
New
York
City
cab
ride
costs
$
4.50


In [24]:
doc4 = nlp(u"Let's visit St. Louis in the USA next year")

In [25]:
for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
USA
next
year


In [26]:
#count the number of tokens
len(doc4)

10

In [30]:
#spacy is also able to recognize named entities
doc8 = nlp(u"Apple to build a Hong Kong factory for $6 million")


In [31]:
#visualize how the tokens separate
for token in doc8:
    print(token.text, end = ' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [35]:
#spacy is smart enough to figure out that these words are 'special' or named entities, basically there is more context to these words
for entity in doc8.ents:
    print(entity)
    print(entity.label_)
    print(str(spacy.explain(entity.label_)))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




In [36]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers")

In [38]:
#we can also see the 'noun chunks' in a sentence
for chunk in doc9.noun_chunks:
    print(chunk)

Autonomous cars
insurance liability
manufacturers


In [39]:
#visualizing tokenization
from spacy import displacy

In [40]:
doc  = nlp(u"Apple is going to build a UK factory for $6 million")

In [43]:
#woah this is really cool!
displacy.render(doc, style = 'dep', jupyter = True, options = {'distance':90})

In [44]:
doc = nlp(u"Over the last quarter, Apple sold nearly 20 thousand ipods for a profit of $6 million")

In [45]:
#finds entity, highlites it and names it as a named entity, really amazing
displacy.render(doc, style = 'ent', jupyter = True, options = {'distance':90})

For more information about displacy, including styling options, visit https://spacy.io/usage/visualizers

## Stemming

stemming chops off letters from the end of a word until a consensus word is reached, spacy doesn't include a stemmer, it uses a lemmatization package. Stemming is crude and choppy, which is why it is not included in spacy.

In [49]:
import nltk

from nltk.stem.porter import *

In [50]:
p_stemmer = PorterStemmer()

In [57]:
words = ['run', 'runner', 'runs', 'easily', 'fairly', 'fairness']

In [58]:
#notice how this stemmer treats these words, pretty weird with easily and fairly
for word in words:
    print(word + '------>' + p_stemmer.stem(word))

run------>run
runner------>runner
runs------>run
easily------>easili
fairly------>fairli
fairness------>fair


In [53]:
# better version of a stemmer is the snowball stemmer
from nltk.stem.snowball import SnowballStemmer

In [55]:
snow = SnowballStemmer(language = 'english') #must pass a language to this stemmer

In [59]:
#turns out this stems the words a little better, particularly with fair, what is important here is that you can understand the process
for word in words:
    print(word + '------>' + snow.stem(word))

run------>run
runner------>runner
runs------>run
easily------>easili
fairly------>fair
fairness------>fair


In [60]:
words = ['generate', 'generation', 'generous', 'generously']

In [61]:
for word in words:
    print(word + '------>' + snow.stem(word))

generate------>generat
generation------>generat
generous------>generous
generously------>generous


In [62]:
#lemmatization is probably a more effective way of reducing words to their roots, but its good to know about stemming

## Lemmatization

lemmatization considers a language's full vocabulary, it has a more soohisticated process. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse', not some random stem like 'mic' or 'was'. Words will be reduced based on their actual use in a sentence, it uses the surrounding text to reduce words. 

In [73]:
doc1 = nlp(u"I am a runner running in a race, becuase I love to run since I ran today")

In [74]:
for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

I 	 PRON 	 561228191312463089 	 -PRON-
am 	 VERB 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
, 	 PUNCT 	 2593208677638477497 	 ,
becuase 	 NOUN 	 3636336227294319702 	 becuase
I 	 PRON 	 561228191312463089 	 -PRON-
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 ADP 	 10066841407251338481 	 since
I 	 PRON 	 561228191312463089 	 -PRON-
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


This function below formats the table above in a better wat, makes it more aesthetically aligned

In [75]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [76]:
show_lemmas(doc1)

I            PRON   561228191312463089     -PRON-
am           VERB   10382539506755952630   be
a            DET    11901859001352538922   a
runner       NOUN   12640964157389618806   runner
running      VERB   12767647472892411841   run
in           ADP    3002984154512732771    in
a            DET    11901859001352538922   a
race         NOUN   8048469955494714898    race
,            PUNCT  2593208677638477497    ,
becuase      NOUN   3636336227294319702    becuase
I            PRON   561228191312463089     -PRON-
love         VERB   3702023516439754181    love
to           PART   3791531372978436496    to
run          VERB   12767647472892411841   run
since        ADP    10066841407251338481   since
I            PRON   561228191312463089     -PRON-
ran          VERB   12767647472892411841   run
today        NOUN   11042482332948150395   today


## Stop Words

Stop words are really common words that give you no additional information (e.g. 'the', 'a', etc)

In [88]:
#here are all the stop words in spacy
print(nlp.Defaults.stop_words)
len(nlp.Defaults.stop_words)

{'this', 'themselves', 'latterly', 'somehow', 'eight', 'really', 'keep', 'whatever', 'being', 'were', 'whole', 'hereafter', 'myself', 'it', 'how', 'formerly', 'via', 'her', 'five', "'ll", 'besides', 'became', 'through', 'put', 'unless', "'re", 'eleven', 'btw', 'too', '‘ve', 'the', 'some', 'beyond', 're', 'due', 'next', 'beforehand', 'noone', 'towards', 'else', 'around', 'whither', 'sometime', 'used', 'down', 'show', 'well', 'elsewhere', 'nothing', 'been', 'name', 'twelve', 'using', 'hundred', 'sixty', 'everyone', 'their', 'fifteen', 'herself', 'ever', 'into', 'would', 'nobody', 'only', 'other', 'together', "n't", 'hers', 'all', 'at', 'out', 'once', 'to', 'becomes', 'fifty', 'up', 'must', 'along', 'own', 'behind', 'but', 'toward', 'everything', 'each', 'whereas', 'why', 'thru', 'also', 'may', 'any', 'except', 'i', 'does', 'mostly', 'yet', 'by', 'can', 'no', 'or', 'namely', '‘d', 'yourself', 'amount', 'another', 'further', 'seems', 'back', 'over', 'something', 'more', 'here', 'someone', 

327

In [89]:
#check to see if a word is a stop word
nlp.vocab['is'].is_stop

True

In [93]:
# you can also add stop words in depending on the case you are working on, maybe there are words that are not adding much information that you want to get rid of 
nlp.Defaults.stop_words.add('lmfao')

In [94]:
#set is_stop to True to add the stop word into the stop words dictionary
nlp.vocab['lmfao'].is_stop = True

In [95]:
#yep, it was added in
len(nlp.Defaults.stop_words)

328

In [99]:
#we can also remove stop words that we would like to take into account in our analysis
nlp.Defaults.stop_words.remove('beyond')
nlp.vocab['beyond'].is_stop = False
nlp.vocab['beyond'].is_stop

False

## Phrase Matching and Vocabulary

In [100]:
from spacy.matcher import Matcher



The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

### Token wildcard
You can pass an empty dictionary `{}` as a wildcard to represent **any token**. For example, you might want to retrieve hashtags without knowing what might follow the `#` character:
>`[{'ORTH': '#'}, {}]`

In [102]:
matcher = Matcher(nlp.vocab) #pass in the vocab to the matcher

In [103]:
pattern1 = [{'LOWER':'solarpower'}] #solarpower

pattern2 = [{'LOWER':'solar', 'IS_PUNCT': True, 'LOWER':'power'}] #solar-power (or any other punctuation)

pattern3 = [{'LOWER':'solar', 'LOWER': 'power'}] #solar power

We want to make the word 'SolarPower' for any found matches

In [104]:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

In [106]:
doc = nlp(u"The Solar Power industry continues to grow as solarpower use increases, solar-power")

found_matches = matcher(doc)

print(found_matches)

[(8656102463236116519, 2, 3), (8656102463236116519, 8, 9), (8656102463236116519, 14, 15)]


In [107]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 2 3 Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 14 15 power


### Setting pattern options and quantifiers
You can make token rules optional by passing an `'OP':'*'` argument. This lets us streamline our patterns list:

In [None]:
# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

In [110]:
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

In [112]:
doc2 = nlp(u"Solar--power is solarpower")

In [113]:
found_matches = matcher(doc2)

In [114]:
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 5)]


In [115]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 0 3 The Solar Power
8656102463236116519 SolarPower 4 5 continues


In [116]:
doc2

Solar--power is solarpower

___
## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [117]:
from spacy.matcher import PhraseMatcher

In [118]:
matcher = PhraseMatcher(nlp.vocab)

In [124]:
import os
os.chdir('C:\\Users\\Sam Cannon\\Desktop\\Python\\Udemy Courses\\NLP\\UPDATED_NLP_COURSE\\TextFiles')

In [127]:
with open('reaganomics.txt') as f:
    doc3 = nlp(f.read())

In [128]:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [129]:
#creating a bunch of documents from the text in the above list, essentially tokenizes everything so that we can analyze it using nlp()
phrase_patterns = [nlp(text) for text in phrase_list]

In [132]:
matcher.add('EconMatcher', None, *phrase_patterns) #grabs each document and passes it into the matcher as a pattern

In [133]:
found_matches = matcher(doc3)

In [134]:
found_matches

[(3680293220734633682, 41, 45),
 (3680293220734633682, 49, 53),
 (3680293220734633682, 54, 56),
 (3680293220734633682, 61, 65),
 (3680293220734633682, 673, 677),
 (3680293220734633682, 2985, 2989)]

In [136]:
#now we can see where the matches were and what the words were in the document
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc3[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2985 2989 trickle-down economics


In [138]:
#we can also grab context around these found matches, we can find the sentences that thet are in and have those printed as well using a subsetting method
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc3[start-10:end+10]    # get the matched span and add in the amount of words we want to see, this adds any amount of tokens before or after (-10, +10)
    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo
3680293220734633682 EconMatcher 49 53 associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-
3680293220734633682 EconMatcher 54 56 economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by
3680293220734633682 EconMatcher 61 65 down economics or voodoo economics by political opponents, and free-market economics by political advocates.

The four pillars of Reagan
3680293220734633682 EconMatcher 673 677 At the same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-
3680293220734633682 EconMatcher 2985 2989 against institutions.[66] His policies became widely known as "trickle-down economics", due to the significant c