Spacy Basics


In [35]:
import spacy

In [36]:
# en_core_web_sm is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.
nlp = spacy.load("en_core_web_sm")

In [37]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million.') # this is gonna parse the sentence each word into "token"

In [38]:
for token in doc :
  print(token.text, token.pos_)

Tesla NOUN
is AUX
looking VERB
at ADP
buying VERB
U.S. PROPN
startup NOUN
for ADP
$ SYM
6 NUM
million NUM
. PUNCT


In [39]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f9cc35c5440>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f9cc35c5210>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f9cbf0b0d50>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f9cbf03d280>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f9cbf0465f0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f9cbf0b0b50>)]

In [40]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [41]:
doc[0].pos_

'NOUN'

In [42]:
print(doc[0].text)

Tesla


In [43]:
print(doc[0].lemma_)

tesla


In [44]:
print(doc[0].shape_)

Xxxxx


In [45]:
doc2 = nlp(u"Hi. My name is Selen. Nice to meet you.")
for sentence in doc2.sents:
  print(sentence)

Hi.
My name is Selen.
Nice to meet you.


Tokenization

Tokenization is breaking the raw text into small chunks.

In [46]:
mystring = '" We\'re moving to L.A.! "'
print(mystring)

" We're moving to L.A.! "


In [47]:
doc = nlp(mystring)
for token in doc:
  print(token.text)

"
We
're
moving
to
L.A.
!
"


In [48]:
for token in doc:
  print(token.text, end=' | ')

" | We | 're | moving | to | L.A. | ! | " | 

In [49]:
doc2 = nlp(u"Apple to build Hong Kong factory for $6 million")
for entity in doc2.ents:
  print(entity)
  print(entity.label_)
  print(str(spacy.explain(entity.label_)))
  print('\n')

Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




In [50]:
# displacy is a built in visualizer
from spacy import displacy 

In [51]:
doc = nlp(u"Apple is going to build a U.K. factory for $6 million")

In [52]:
# there are many types of style for more -> https://spacy.io/usage/visualizers
displacy.render(doc, style='dep',jupyter=True, options={'distance': 110})

In [53]:
displacy.render(doc, style='ent',jupyter=True, options={'distance': 110})

Stemming

Stemming is the process of removing a part of a word, or reducing a word to its stem or root.

In [54]:
import nltk
from nltk.stem.porter import PorterStemmer 

In [55]:
p_stemmer = PorterStemmer()

In [56]:
words = ['runner', 'ran', 'runs', 'easily', 'run', 'fairly', 'fairness']

In [57]:
for word in words:
  print(word + ' ------> ' + p_stemmer.stem(word))

runner ------> runner
ran ------> ran
runs ------> run
easily ------> easili
run ------> run
fairly ------> fairli
fairness ------> fair


In [58]:
from nltk.stem.snowball import SnowballStemmer

In [59]:
s_stemmer = SnowballStemmer(language='english')

In [60]:
for word in words:
  print(word + ' ------> ' + s_stemmer.stem(word))

runner ------> runner
ran ------> ran
runs ------> run
easily ------> easili
run ------> run
fairly ------> fair
fairness ------> fair


Lemmatization

In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word. There are definitely different algorithms used to find out how many characters have to be chopped off, but the algorithms don’t actually know the meaning of the word in the language it belongs to. In lemmatization, on the other hand, the algorithms have this knowledge. (https://towardsdatascience.com/lemmatization-in-natural-language-processing-nlp-and-machine-learning-a4416f69a7b6)

Lemmatization is more informative than stemming which is why spacy has opted to only have lemmatization available instead of stemming.

In [61]:
doc = nlp(u"I am a runner running in a race because I love to run since I ran today")
for token in doc:
  print(token.text, '\t\t', token.pos_, '\t\t', token.lemma_)

I 		 PRON 		 I
am 		 AUX 		 be
a 		 DET 		 a
runner 		 NOUN 		 runner
running 		 VERB 		 run
in 		 ADP 		 in
a 		 DET 		 a
race 		 NOUN 		 race
because 		 SCONJ 		 because
I 		 PRON 		 I
love 		 VERB 		 love
to 		 PART 		 to
run 		 VERB 		 run
since 		 SCONJ 		 since
I 		 PRON 		 I
ran 		 VERB 		 run
today 		 NOUN 		 today


Stop Words

These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”. (https://towardsdatascience.com/text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a)

In [62]:
print(nlp.Defaults.stop_words)

{'‘ll', 'say', 'below', 'itself', 'quite', 'became', 'please', 'down', 'since', "'ve", 'hers', 'nor', 'throughout', 'beyond', 'bottom', "'s", 'third', 'which', 'at', 'ourselves', 'made', 'every', 'too', 'seemed', 'nevertheless', 'are', 'toward', 'anything', 'neither', 'ours', 'own', 'five', 'is', 'it', 'along', 'empty', 'the', 'thus', 'ever', 'nothing', 'during', 'go', 'enough', 'alone', 'me', '’ve', 'before', 'latter', 'else', 'few', 'mine', 'under', '‘m', 'those', 'yours', 'a', 'name', 'meanwhile', 'otherwise', 'first', 'beforehand', 'sometime', 'fifty', 'amount', 'being', 'though', 'be', 'somewhere', 'why', 'them', 'anyone', 'four', 'does', 'whither', 'more', '‘re', 'nine', 'top', 'take', '’s', 'least', 'here', 'whence', 'yourself', 'eight', 'while', 'amongst', 'several', 'have', 'six', 'they', 'everyone', 'now', 'move', 'was', 'and', 'full', 'further', '’re', 'thence', 'us', 'herself', 'hereby', 'side', 'from', 'therefore', 'some', 'become', 'most', 'done', 'whole', 'see', 'almost'

In [63]:
len(nlp.Defaults.stop_words)

326

In [64]:
nlp.vocab['is'].is_stop # to check is it a stop word or not

True

In [65]:
nlp.Defaults.stop_words.add('btw') # to add a specific stop word

In [66]:
nlp.vocab['btw'].is_stop

True

In [69]:
nlp.Defaults.stop_words.remove('please') # to remove a stop word

In [70]:
nlp.vocab['please'].is_stop

False

Phrase Matching and Vocabulary

In [71]:
from spacy.matcher import Matcher

In [73]:
matcher = Matcher(nlp.vocab)

In [86]:
# solarpower
# solar-power
# solar power
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]

In [87]:
matcher.add('SolarPower', [pattern1, pattern2, pattern3])

In [88]:
doc = nlp(u"The Solar Power industiry continues to grow a solarpower increases.Solar-power is amazing!")

In [89]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 8, 9), (8656102463236116519, 11, 14)]


In [90]:
# phrase matching
from spacy.matcher import PhraseMatcher

In [92]:
matcher = PhraseMatcher(nlp.vocab)

In [102]:
with open('reaganomics.txt', encoding="utf8", errors='ignore') as f:
  doc = nlp(f.read())

In [103]:
phareses = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [110]:
pharese_patterns = [nlp(text) for text in phareses]

In [111]:
matcher.add('EconMatcher', None, *pharese_patterns)

In [113]:
found_matches = matcher(doc)

In [114]:
print(found_matches)

[(3680293220734633682, 41, 45), (3680293220734633682, 49, 53), (3680293220734633682, 54, 56), (3680293220734633682, 61, 65), (3680293220734633682, 673, 677), (3680293220734633682, 2986, 2990)]
