# Stemming, Lemmatization, Stopword case folding N-grams HTML tags

This notebook contains some examples for stemming, lemmatization, stopword, case folding, N-grams and HTML tags.

# Exploring Tokenization

In [1]:
import nltk

In [2]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating',
           'siezing', 'itemization', 'traditional', 'reference', 'colonizer', 'plotted', 'having', 'generously']

# Porter Stemmer

In [3]:
from nltk.stem.porter import PorterStemmer 
stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have gener


# Snowball Stemmer

In [4]:
from nltk.stem.snowball import SnowballStemmer
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [5]:
stemmer2 = SnowballStemmer(language='english')
singles = [stemmer2.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have generous


# Wordnet Lemmatizer

In [6]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
lemmatizer = WordNetLemmatizer()
s = "We are putting effort into our understanding of Lemmatization"
token_list = s.split()
print("The tokens are: ", token_list)
lemmatized_output = ' '.join([lemmatizer.lemmatize(token) for token in token_list])
print("The lemmatized output is: ", lemmatized_output)

The tokens are:  ['We', 'are', 'putting', 'effort', 'into', 'our', 'understanding', 'of', 'Lemmatization']
The lemmatized output is:  We are putting effort into our understanding of Lemmatization


## POS Tagging

In [8]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(token_list)
pos_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('We', 'PRP'),
 ('are', 'VBP'),
 ('putting', 'VBG'),
 ('effort', 'NN'),
 ('into', 'IN'),
 ('our', 'PRP$'),
 ('understanding', 'NN'),
 ('of', 'IN'),
 ('Lemmatization', 'NN')]

## POS tag Mapping

In [9]:
from nltk.corpus import wordnet

##This is a common method which is widely used across the NLP community of practitioners and readers

def get_part_of_speech_tags(token):
    
    """Maps POS tags to first character lemmatize() accepts.
    We are focussing on Verbs, Nouns, Adjectives and Adverbs here."""

    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    
    tag = nltk.pos_tag([token])[0][1][0].upper()
    print(tag)
    
    return tag_dict.get(tag, wordnet.NOUN)

In [10]:
get_part_of_speech_tags('good')

J


'a'

## Wordnet Lemmatizer with POS Tag Information

In [11]:
print(token_list)
lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token, get_part_of_speech_tags(token)) for token in token_list]
print(' '.join(lemmatized_output_with_POS_information))

['We', 'are', 'putting', 'effort', 'into', 'our', 'understanding', 'of', 'Lemmatization']
P
V
V
N
I
P
V
I
N
We be put effort into our understand of Lemmatization


## Lemmatization vs Stemming

In [12]:
stemmer2 = SnowballStemmer(language='english')
stemmed_sentence = [stemmer2.stem(token) for token in token_list]
print(' '.join(stemmed_sentence))

we are put effort into our understand of lemmat


# Stopwords

In [13]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
", ".join(stop)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


"further, yours, we, do, again, more, your, until, should, ma, wasn't, been, you'd, i, its, the, doesn, while, be, such, against, just, shouldn, too, am, o, they, some, during, about, it's, s, re, nor, doesn't, hasn, hadn't, these, for, myself, then, once, themselves, where, after, y, whom, through, doing, there, no, when, his, to, ours, by, hadn, into, each, our, t, needn, is, has, ll, own, don't, himself, ain, did, should've, you're, herself, down, weren, shouldn't, their, does, mustn't, in, this, or, aren, you, theirs, wasn, ourselves, have, you've, yourselves, being, will, same, didn't, you'll, had, with, so, haven't, why, having, as, than, from, are, over, yourself, mustn, if, who, d, wouldn't, up, which, at, them, under, she, below, couldn't, of, was, other, but, mightn't, a, won, shan't, couldn, above, can, what, needn't, out, isn, were, didn, hasn't, an, between, all, he, that'll, because, any, off, don, me, here, most, not, m, wouldn, hers, mightn, those, before, only, aren't,

In [14]:
wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']

stop = set(stopwords.words('english'))

sentence = "how are we putting into our understanding of Lemmatization"

for word in wh_words:
    stop.remove(word)

sentence_after_stopword_removal = [token for token in sentence.split() if token not in stop]
" ".join(sentence_after_stopword_removal)

'how putting understanding Lemmatization'

# Case Folding

In [15]:
s = "We are putting efforts into our understanding of Lemmatization"
s = s.lower()
s

'we are putting efforts into our understanding of lemmatization'

# N-grams

n-grams is a contiguous sequence of `n` items generated from a given sample of text where the items can be characters or words and `n` can be any numbers from 1. It is possible to generate all possible contiguous combinations of length `n` for the words in the sentence.

In [16]:
from nltk.util import ngrams
s = "Natural Language Processing is a branch of AI"
tokens = s.split()
unigrams = list(ngrams(tokens, 1))
[" ".join(token) for token in unigrams]

['Natural', 'Language', 'Processing', 'is', 'a', 'branch', 'of', 'AI']

In [17]:
bigrams = list(ngrams(tokens, 2))
[" ".join(token) for token in bigrams]

['Natural Language',
 'Language Processing',
 'Processing is',
 'is a',
 'a branch',
 'branch of',
 'of AI']

In [18]:
trigrams = list(ngrams(tokens, 3))
[" ".join(token) for token in trigrams]

['Natural Language Processing',
 'Language Processing is',
 'Processing is a',
 'is a branch',
 'a branch of',
 'branch of AI']

`n`-grams are useful to create features from text corpus for machine learning algorithm such as SVM. `n`-grams also create capabilities to autocorrect, autocomplete sentences, summarize text, and so on. 

# Building a basic vocabulary

In [19]:
s = "Natural Language Processing is a branch of AI"
tokens = set(s.split())
vocabulary = sorted(tokens)
vocabulary

['AI', 'Language', 'Natural', 'Processing', 'a', 'branch', 'is', 'of']

# Removing HTML Tags

In [20]:
html = "<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>"
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

My First HeadingMy first paragraph.
