### Lemmaztization:

Often when searching text for a certain keyword, it helps when the search returns variations of the word!
Also lemmatization is much more informative and makes sense than stemming. It looks at the surrouding words and extract the Parts of Speech, it doesn't categorize phrases.

For instance, searching for a boat might also return "boats" and "boating". Here, "boat" would be the stem for {boating, boats, boater, boats}

Stemming is somewhat crud method for cataloging related words; it essentially chops off the letter from the words untill the stem is reached!

This words faily well in most of the cases but, unfortunately English has many exceptions where a more sophisticated process is required.

Spacy doesn't include Stemmer, rather it's more lien towards lemmatization. And because Spacy doesn't include stemmers, let's jump onto the NLTK libray in order to understand the stemming process!

We'll discuss both Porter Stemmer and Snowball Stemmer, both are the best stemming algorithms

### Porter Stemmer:

The Algorithm employes different phases of word reduction, each with it's own set of mapping rules. Here's a couple of them:

* Initially, simple suffix rules are defined, which correlates to dropping the suffix until a stem word is extracted e.g SSES --> SS or may be Cats --> Cat
* More sohpisticated phases consider the length / complexity of the word before applying a rule, e.g (m > 0) --> "ATIONAL" = ATE or (m > 0) --> "EED" = "EE" / relational = relate and agreed = agree


### Snowball Stemmer:

Snowball is the name of stemming language, developed by Martin Porter. This Algorithm offers a slight improvement over the Porter Stemmer Algorithm. Let's go ahead and see how to use these Stemmers in Python and NLTK

In [1]:
import nltk # import Natural language toolkit

In [2]:
from nltk.stem.porter import PorterStemmer # importing the Algorithm

In [3]:
p_stemmer = PorterStemmer() # instantiating the PorterStemmer class

In [4]:
words = ['run', 'ran', 'running', 'easily', 'fairly', 'nicely', 'awesomely', 'fairness', 'far']

for word in words:
    print(word + ' -----> ' + (p_stemmer.stem(word)))
    
# As we can see the adverbs e.g "easily" have been transformed to some weird words, which end of 'li'
# according to the Porter Stemmer Algorithm rules.

run -----> run
ran -----> ran
running -----> run
easily -----> easili
fairly -----> fairli
nicely -----> nice
awesomely -----> awesom
fairness -----> fair
far -----> far


In [5]:
# Lets go ahead and stem them now with Snowball Stemmer

from nltk.stem.snowball import SnowballStemmer

In [6]:
s_stemmer = SnowballStemmer(language = 'english') # we need to provide the language we are using in case of Snowball Stemmer

In [7]:
words

['run',
 'ran',
 'running',
 'easily',
 'fairly',
 'nicely',
 'awesomely',
 'fairness',
 'far']

In [8]:
for word in words:
    print(word + " -----> " + s_stemmer.stem(word))
    
# notice how it extracts the perfect stems for both 'fairness' and 'fairly' i.e 'fair', that's a slight imporvement
# over Porter Stemmer!

run -----> run
ran -----> ran
running -----> run
easily -----> easili
fairly -----> fair
nicely -----> nice
awesomely -----> awesom
fairness -----> fair
far -----> far


In [9]:
# Let's take another example:

words1 = ['generous', 'generously', 'generation', 'generate']

print('PORTER STEMMER!')
print('\n')
for word in words1:
    print(word + ' -----> ' + p_stemmer.stem(word))
    
print('\n')
print('===================================')
print('\n')
print('SNOWBALL STEMMER!')
print('\n')
for word in words1:
    print(word + ' ------> ' + s_stemmer.stem(word))

# Now we can clearly visualize the differenece both the Algorithms have, on the basis of the way they relate
# back to a particular stem!

PORTER STEMMER!


generous -----> gener
generously -----> gener
generation -----> gener
generate -----> gener




SNOWBALL STEMMER!


generous ------> generous
generously ------> generous
generation ------> generat
generate ------> generat


### Lemmatization with Spacy:

In [10]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(u"I am a runner, running in a race because I love to run since I ran today!")

# This sentence has got lots of words that have kind of similar meanings. e.g running, run, ran etc
# Now let's see how sapCy breaks them down the particular lemmas

for token in doc:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)
    
# token.text -> text of the token
# token.pos_ -> POS Tag of the word
# token.lemma -> hashed number, generated corresponding to particular lemma in dic (en_core_web_sm)
# token.lemma_ -> actual lemma for the word!

I 	 PRON 	 561228191312463089 	 -PRON-
am 	 VERB 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
, 	 PUNCT 	 2593208677638477497 	 ,
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 ADP 	 16950148841647037698 	 because
I 	 PRON 	 561228191312463089 	 -PRON-
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 ADP 	 10066841407251338481 	 since
I 	 PRON 	 561228191312463089 	 -PRON-
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today
! 	 PUNCT 	 17494803046312582752 	 !


In [11]:
def show_lemmas(text):
    for token in text:
        print(f"{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}")

In [12]:
show_lemmas(doc) # prints in nice readable form

I            PRON   561228191312463089     -PRON-
am           VERB   10382539506755952630   be
a            DET    11901859001352538922   a
runner       NOUN   12640964157389618806   runner
,            PUNCT  2593208677638477497    ,
running      VERB   12767647472892411841   run
in           ADP    3002984154512732771    in
a            DET    11901859001352538922   a
race         NOUN   8048469955494714898    race
because      ADP    16950148841647037698   because
I            PRON   561228191312463089     -PRON-
love         VERB   3702023516439754181    love
to           PART   3791531372978436496    to
run          VERB   12767647472892411841   run
since        ADP    10066841407251338481   since
I            PRON   561228191312463089     -PRON-
ran          VERB   12767647472892411841   run
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !


### Stop Words:

Words like "a" and "the" appear so frequently in context, that we don't even want to Tag them out as Nouns, Verbs, and other POS. We call these "Stop words" and they need to be filtered out from the text to be processed.

Spacy holds a built-in list of some 305 Stop words for English. Let's go ahead and figure out how it works in Spacy!

In [15]:
print(nlp.Defaults.stop_words) # throws back the set of all the stop words spacy has

{'out', 'fifteen', 'these', 'hereafter', 'some', 'though', 'because', 'make', 'elsewhere', 'anyone', 'am', 'further', 'around', 'itself', 'perhaps', 'does', 'somehow', 'a', 'else', 'please', 'per', 'any', 'hers', 'more', 'then', 'was', 'whereas', 'three', 'yourselves', 'did', 'too', 'can', 'at', 'seeming', 'so', 'former', 'formerly', 'that', 'via', 'of', 'here', 'had', 'another', 'full', 'without', 'your', 'whereafter', 'last', 'their', 'the', 'being', 'cannot', 'four', 'very', 'yourself', 'be', 'latter', 'fifty', 'seems', 'could', 'with', 'themselves', 'again', 'whom', 'only', 'me', 'sometimes', 'this', 'ca', 'his', 'becoming', 'all', 'by', 'may', 'herein', 'besides', 'many', 'no', 'are', 'everywhere', 'became', 'anyhow', 'as', 'been', 'my', 'several', 'amongst', 'next', 'whenever', 'where', 'less', 'who', 'always', 'wherein', 'call', 'two', 'would', 'for', 'hence', 'seem', 'thereupon', 'when', 'top', 'while', 'even', 'it', 'made', 'every', 'before', 'enough', 'someone', 'five', 'she'

In [17]:
nlp.vocab['is'].is_stop # reports back 'is' is a stop word

True

In [18]:
nlp.vocab['mystery'].is_stop # reports back 'mystery' is not a stop word

False

In [24]:
nlp.Defaults.stop_words.add('btw') # adds 'btw' as a stop word
nlp.vocab['btw'].is_stop = True

In [25]:
len(nlp.Defaults.stop_words) # reports back 'btw' is added to the set of stop words

306

In [27]:
nlp.vocab['btw'].is_stop # tell it's a stop word

True

In [30]:
# Let's now go ahead and remove some of the stop words from the built-in spacy set

nlp.Defaults.stop_words.remove('few')

nlp.vocab['few'].is_stop = False

In [34]:
nlp.vocab['few'].is_stop # successfully removed from the list of stop words

False

### Good Job!