## Visualizing 

In [4]:
import spacy

In [2]:
from spacy import displacy

In [6]:
nlp = spacy.load("en_core_web_sm")

In [7]:
doc = nlp(u"Apple is going to build a U.K factory fo $6 milllion. ")

In [11]:
displacy.render(doc, style='dep', jupyter=True, options = {'distance ':50})

In [12]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand ipads for a profit of $6 million.')

In [13]:
displacy.render(doc, style='ent', jupyter=True)

In [15]:
doc = nlp(u'This is a sentence.')


In [18]:
displacy.serve(doc, style = 'dep') # on port 5000

OSError: [Errno 48] Address already in use

### docs 
https://spacy.io/usage/visualizers

## Stemming

cataloging related words 
reduce word to its root

* Porter Stemmer method , by Martin Porter 
* Snowball, by Martin Porter 

In [20]:
import nltk

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


In [21]:
from nltk.stem.porter import PorterStemmer

In [22]:
p_stemmer = PorterStemmer()

In [34]:
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']


In [35]:
for word in words:
    print(word + '-------->' + p_stemmer.stem(word))

run-------->run
runner-------->runner
ran-------->ran
runs-------->run
easily-------->easili
fairly-------->fairli
fairness-------->fair


In [29]:
from nltk.stem.snowball import SnowballStemmer

In [31]:
s_stemmer = SnowballStemmer(language ='english')

In [36]:
for word in words:
    print(word + '-------->' +s_stemmer.stem(word))

run-------->run
runner-------->runner
ran-------->ran
runs-------->run
easily-------->easili
fairly-------->fair
fairness-------->fair


In [37]:
words = ['generous', 'generation', 'generously', 'generate']

In [38]:
for word in words:
    print(word + '-------->' +s_stemmer.stem(word))

generous-------->generous
generation-------->generat
generously-------->generous
generate-------->generat


## Lemmatization 

morphological analysis to words /
**more imformative** than simple stemming / 
reducing words to a true root 

finding lemma

* was -> be
* mice -> mouse
* meeting -> meet or meeting 


In [40]:
doc1 = nlp(u'I am a runner running in a race because I love to run since I ran today')

In [42]:
for t in doc1:
    print(t.text, '\t', t.pos_, '\t', t.lemma, '\t', t.lemma_) # hashcode for a lookup

I 	 PRON 	 4690420944186131903 	 I
am 	 AUX 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 SCONJ 	 16950148841647037698 	 because
I 	 PRON 	 4690420944186131903 	 I
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 SCONJ 	 10066841407251338481 	 since
I 	 PRON 	 4690420944186131903 	 I
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


In [43]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [44]:
doc2= nlp(u'I saw ten mice today~')

In [45]:
show_lemmas(doc2)

I            PRON   4690420944186131903    I
saw          VERB   11925638236994514241   see
ten          NUM    7970704286052693043    ten
mice         NOUN   1384165645700560590    mouse
today~       PROPN  5700512790091430924    today~


## Stop Words

Words like "a" and "the" appear so frequently that they don't require tagging as thoroughly as nouns, verbs and modifiers. We call these *stop words*, and they can be filtered from the text to be processed. spaCy holds a built-in list of some 305 English stop words.

In [47]:
print(nlp.Defaults.stop_words)

{'as', 'wherever', 'may', 'thereupon', 'using', 'from', 'he', '‘m', 'next', 'being', 'towards', 'thereafter', 'n‘t', 'above', 'same', 'anywhere', 'therein', 'former', 'her', 'whole', 'together', 'take', 'due', 'within', 'please', 'or', 'hereby', 'sixty', 'give', 'against', 'elsewhere', 'whenever', 'nowhere', 'move', 'during', 'serious', '’s', 'others', 'somewhere', 'eleven', 'their', 'nobody', 'no', 'something', 'do', 'nevertheless', 'hereafter', 'once', 'part', 'below', 'itself', 'whose', 'again', '’ve', 'thence', 'become', 'everyone', 'front', 'make', 'since', "'d", 'nor', 'every', 'noone', 'becomes', 'herein', 'to', 'whom', 'such', 'somehow', 'around', 'while', 'name', 'thereby', 'our', 'but', 'without', 'yours', 'nine', '‘re', 'all', 'still', 'now', 'so', 'beyond', 'upon', 'empty', 'herself', 'moreover', 'then', 'afterwards', 'what', 'enough', 'sometimes', 'also', 'latterly', 'my', 'quite', "'re", 'those', 'first', "'m", 'say', 'three', 'whereupon', 'am', 'whoever', 'were', 'himsel

In [48]:
len(nlp.Defaults.stop_words)

326

In [49]:
# how to check if a word is a stop word
nlp.vocab['myself'].is_stop

True

In [50]:
nlp.vocab['is'].is_stop

True

In [51]:
nlp.vocab['mystery'].is_stop

False

In [52]:
nlp.Defaults.stop_words.add('btw')

In [53]:
nlp.vocab['btw'].is_stop

True

In [54]:
len(nlp.Defaults.stop_words)

327

In [57]:
nlp.Defaults.stop_words.remove('btw')

KeyError: 'btw'

In [59]:
nlp.vocab['btw'].is_stop = False

In [60]:
nlp.vocab['btw'].is_stop

False

In [61]:
len(nlp.Defaults.stop_words)

326

## Phrase Matching and Vocabulary

== powerful reg expression
### Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [79]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [80]:
matcher = Matcher(nlp.vocab)

#### Creating patterns

In [81]:
pattern1 = [{'LOWER': 'solarpower'}] # SolarPower
pattern3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}] # Sloar power
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}] # Solar-power

In [84]:
matcher.add('SolarPower', [ pattern1, pattern2, pattern3])

### Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>
​
<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>
​
</table>

#### Applying the matcher to a Doc object

In [85]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

In [86]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


`matcher` returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span `doc[start:end]`

In [87]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


In [88]:
matcher.remove('SolarPower')

In [93]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', [pattern1, pattern2 ])

In [101]:
doc = nlp(u'Solar--power is a solarpower yay!')

In [102]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 6)]


In [103]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 0 3 Solar--power
8656102463236116519 SolarPower 5 6 solarpower


This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


#### https://spacy.io/usage/linguistic-features#section-rule-based-matching

### phrase matcher

In [105]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [122]:
with open('./UPDATED_NLP_COURSE/TextFiles/reaganomics.txt', encoding= 'unicode_escape')  as f:
    doc3 = nlp(f.read())

In [124]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)

# Build a list of matches:
matches = matcher(doc3)

In [130]:
# (match_id, start, end)
matches

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2986, 2990)]

In [127]:
doc3[:65]

REAGANOMICS
https://en.wikipedia.org/wiki/Reaganomics

Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics

In [132]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc3[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

3473369816841043438 VoodooEconomics 41 45 supply-side economics
3473369816841043438 VoodooEconomics 49 53 trickle-down economics
3473369816841043438 VoodooEconomics 54 56 voodoo economics
3473369816841043438 VoodooEconomics 61 65 free-market economics
3473369816841043438 VoodooEconomics 673 677 supply-side economics
3473369816841043438 VoodooEconomics 2986 2990 trickle-down economics


#### Viewing Matches
There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match: