In [None]:
b

## Stemming

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words

    “chocolates”, “chocolatey”, and “choco”  => root word is “chocolate”
    “retrieval”, “retrieved”, “retrieves”  => root word is “retrieve”
    playing, played, playfully => root word is play

Here the root word formed is called ‘stem’ and it is not necessarily that stem needs to exist and have a meaning. Just by committing the suffix and prefix, we generate the stems.

NLTK provides us with `PorterStemmer` `LancasterStemmer` and `SnowballStemmer` packages.


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# tokenize
example_sent = "This is sample sentence, showing the use of the stop word filtration"
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent,)

filtered_tokens = []

for w in word_tokens:
    if w not in stop_words:
        filtered_tokens.append(w)

print(word_tokens)
print(filtered_tokens)

# alternative way
words = [word for word in word_tokens if not word in stop_words]
print(words)

['This', 'is', 'sample', 'sentence', ',', 'showing', 'the', 'use', 'of', 'the', 'stop', 'word', 'filtration']
['This', 'sample', 'sentence', ',', 'showing', 'use', 'stop', 'word', 'filtration']
['This', 'sample', 'sentence', ',', 'showing', 'use', 'stop', 'word', 'filtration']


### Porter Stemmer
Porter Stemmer uses suffix striping to produce stems. It does not follow the linguistic set of rules to produce stem for phases in different cases, due to this reason porter stemmer does not generate stems, i.e. actual English words.

In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['thi', 'sampl', 'sentenc', ',', 'show', 'use', 'stop', 'word', 'filtrat']


### LancasterStemmer

More agresive than porter

In [None]:
from nltk import LancasterStemmer
words = ['sincerely','electricity','roughly','ringing']
Lanc = LancasterStemmer()
for w in words:
    print(w, " : ", Lanc.stem(w))

sincerely  :  sint
electricity  :  elect
roughly  :  rough
ringing  :  ring


## SnowballStemmer

Snowball Stemmer: It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [None]:
# importing modules
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

snowball_stemmer = SnowballStemmer(language='english')

# sentence = "Programmers program with programming languages"
sentence = "Snowball Stemmer: It is a stemming algorithm which is also known as the Porter2 stemming algorithms as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer."
words = word_tokenize(sentence)

for w in words:
    print(w, " : ", snowball_stemmer.stem(w))

## Lemmatization:

We want to extract the base form of the word here.

- The word extracted here is called `Lemma` and it is available in the dictionary.
- We have the `WordNet` corpus and the lemma generated will be available in this corpus.
- NLTK provides us with the WordNet Lemmatizer that makes use of the WordNet Database to lookup lemmas of words.

NOteLStemming is much faster than lemmatization as it doesn’t need to lookup in the dictionary and just follows the algorithm to generate the root words.

Text preprocessing includes both Stemming as well as Lemmatization. __Many times people find these two terms confusing.__

Some treat these two as the same. __Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words__.

### WordNetLemmatizer

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [None]:
wnl = WordNetLemmatizer()
lemmed = wnl.lemmatize('fixing')
print(lemmed)

fixing


In [None]:
lemmed = wnl.lemmatize('fixing', 'v')
print(lemmed)

fix


In [None]:
# Reduce words to their root form
sentence = "Snowball Stemmer: It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer."
words = word_tokenize(sentence)
# print(words)
lemmed = [WordNetLemmatizer().lemmatize(w.lower()) for w in words]
print(lemmed)

['snowball', 'stemmer', ':', 'it', 'is', 'a', 'stemming', 'algorithm', 'which', 'is', 'also', 'known', 'a', 'the', 'porter2', 'stemming', 'algorithm', 'a', 'it', 'is', 'a', 'better', 'version', 'of', 'the', 'porter', 'stemmer', 'since', 'some', 'issue', 'of', 'it', 'were', 'fixed', 'in', 'this', 'stemmer', '.']


To make wordnet lematizer work we need to pass the tag with word

that can be done by analyzing sentence

### TextBlob

In [None]:
!pip install textblob

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
from textblob import TextBlob, Word

In [None]:
# Lemmatize a word
word = 'stripes'
w = Word(word)
print(w)
w.lemmatize()
#> stripe

stripes


'stripe'

In [None]:
import textblob
# sentence = "The striped bat be hang on their foot for best"
sentence = "The were was supposed to be be"
sent = TextBlob(sentence)
print(sent)
print(sent.tags)

The were was supposed to be be
[('The', 'DT'), ('were', 'VBD'), ('was', 'VBD'), ('supposed', 'VBN'), ('to', 'TO'), ('be', 'VB'), ('be', 'VB')]


In [None]:
w= Word('were')
print(w.lemmatize('VBD'))
print(w.lemmatize('v'))

be
be


In [None]:
# we have to convert the pos tag into the tag name undertandable by lemitizer
tag_dict = {
    # "D": 'd',
    "J": 'a',
    "N": 'n',
    "V": 'v',
    "R": 'r'
}

words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]
lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
print(sent.tags)
print(words_and_tags)
print(lemmatized_list)

[('The', 'DT'), ('were', 'VBD'), ('was', 'VBD'), ('supposed', 'VBN'), ('to', 'TO'), ('be', 'VB'), ('be', 'VB')]
[('The', 'n'), ('were', 'v'), ('was', 'v'), ('supposed', 'v'), ('to', 'n'), ('be', 'v'), ('be', 'v')]
['The', 'be', 'be', 'suppose', 'to', 'be', 'be']


In [None]:
# for wordnet
wnl_lemmas = [wnl.lemmatize(wd, tag) for wd, tag in words_and_tags]
print(wnl_lemmas)

['The', 'be', 'be', 'suppose', 'to', 'be', 'be']


## Other Lemitizers

- Spacy Lemmatizer
- CLiPS Pattern
- Stanford CoreNLP
- Gensim Lemmatizer
- TreeTagger