We'll cover the following topics:
* Lexicons
* Phonemes, graphemes, and morphemes
* Tokenization
* Understanding word normalization

## Lexicons- 
a lexicon can be thought of as a dictionary of terms that are called lexemes.
For instance, the terms used by medical practitioners can be thought of as a lexicon for their
profession. As an example, when trying to build an algorithm to convert a physical
prescription provided by doctors into an electronic form, the lexicons would
be primarily composed of medical terms.

## Phonemes, graphemes, and morphemes
* __Phonemes__ can be thought of as the speech sounds, made by the mouth or unit of
sound, that can differentiate one word from another in a language.
* __Graphemes__ are groups of letters of size one or more that can represent these
individual sounds or phonemes. The word spoon consists of five letters that
actually represent four phonemes, identified by the graphemes s, p, oo, and n.
* A __morpheme__ is the smallest meaningful unit in a language. The word unbreakable
is composed of three morphemes:
    * un—a bound morpheme signifying not
    * break—the root morpheme
    * able—a free morpheme signifying can be done

## __Tokenization__

## Blankline Tokenizer

In [17]:
import nltk
from nltk.tokenize import BlanklineTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.\n\n I want a book as well"
tokenizer = BlanklineTokenizer()
tokenizer.tokenize(s)

['A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.',
 'I want a book as well']

## WordPunct Tokenizer

In [2]:
from nltk.tokenize import WordPunctTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.\n I want a book as well"
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(s)

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$',
 '3000',
 '.',
 '0',
 '-',
 '$',
 '8000',
 '.',
 '0',
 'in',
 'USA',
 '.',
 'I',
 'want',
 'a',
 'book',
 'as',
 'well']

## Regular expressions-based tokenizers

In [1]:
from nltk.tokenize import RegexpTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

  tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')


['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$3000.0',
 '-',
 '$8000.0',
 'in',
 'USA',
 '.']

The `\w+|\$[\d\.]+|\S+` regular expression allows three alternative patterns:
* First alternative: \w+ that matches any word character (equal to [a-zA-Z0-9_]).
The + is a quantifier and matches between one and unlimited times as many
times as possible.
* Second alternative: \$[\d\.]+. Here, \$ matches the character $, \d matches a
digit between 0 and 9, \. matches the character . (period), and + again acts as a
quantifier matching between one and unlimited times.
* Third alternative: \S+. Here, \S accepts any non-whitespace character and +
again acts the same way as in the preceding two alternatives.

## Treebank tokenizer
The Treebank tokenizer does a great job of splitting contractions such as doesn't to does and
n't. It further identifies periods at the ends of lines and eliminates them. Punctuation such
as commas is split if followed by whitespaces.

In [2]:
from nltk.tokenize import TreebankWordTokenizer
s = "I'm going to buy a Rolex watch that doesn't cost more than $3000.0"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(s)

['I',
 "'m",
 'going',
 'to',
 'buy',
 'a',
 'Rolex',
 'watch',
 'that',
 'does',
 "n't",
 'cost',
 'more',
 'than',
 '$',
 '3000.0']

## TweetTokenizer

The rise of social media has given rise to an informal language wherein
people tag each other using their social media handles and use a lot of emoticons, hashtags,
and abbreviated text to express themselves. We need tokenizers in place that can parse such
text and make things more understandable. TweetTokenizer caters to this use case

In [5]:
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx Watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer(preserve_case=False) # by default preserve_case is kept as True
tokenizer.tokenize(s)

['@amankedia',
 "i'm",
 'going',
 'to',
 'buy',
 'a',
 'rolexxxxxxxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

## __Word Normalization__

Intro - We can also bring words to their root form in the
dictionary. For instance, am, are, and is can be identified by their root form, be. On another
front, we can remove inflections from words to bring them down to the same form. Words
car, cars, and car's can all be identified as car.

Also, common words that occur very frequently and do not convey much meaning, such as
the articles a, an, and the, can be removed. However, all these highly depend on the use
cases. Wh- words, such as when, why, where, and who, do not carry much information in
most contexts and are removed as part of a technique called stopword removal

In [6]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating',
           'siezing', 'itemization', 'traditional', 'reference', 'colonizer', 'plotted', 'having', 'generously']

## Stemming - 
Bringing all of the words computer, computerization, and computerize into one
word, compute. What happens here is called stemming. As part of stemming, a crude
attempt is made to remove the inflectional forms of a word and bring them to a base form
called the stem. The chopped-off pieces are referred to as affixes.

The two most common algorithms/methods employed for stemming include the 
__Porter stemmer__ and the __Snowball stemmer__. 

The Porter stemmer supports the English language,
whereas the Snowball stemmer, which is an improvement on the Porter stemmer, supports
multiple languages

## Porter Stemmer 

In [10]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have gener


## Snowball Stemmer

In [11]:
from nltk.stem.snowball import SnowballStemmer
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [12]:
stemmer2 = SnowballStemmer(language='english')
singles = [stemmer2.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have generous


In most of cases, its output is similar to that of the
Porter stemmer, except for generously, where the Porter stemmer outputs gener and the
Snowball stemmer outputs generous.

## __Lemmatization__

Intro - lemmatization is a process wherein the context is used to convert a word to its meaningful
base form. It helps in grouping together words that have a common base form and so can
be identified as a single item.The base form is referred to as the lemma of the word and is
also sometimes known as the dictionary form.


A lemmatizer
would try and identify the part-of-speech tags based on the context to identify the
appropriate lemma. \
The most commonly used lemmatizer is the WordNet lemmatizer.
Other lemmatizers include the Spacy lemmatizer, TextBlob lemmatizer, and Gensim
lemmatizer, and others.

## WordNet lemmatizerAs part of
WordNet, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive
synonyms (synsets), each expressing distinct concepts. These synsets are interlinked using
lexical and conceptual semantic relationships.



In [14]:
# Execute nltk.download('wordnet') on terminal
from nltk.stem import WordNetLemmatizer 


In [15]:
lemmatizer = WordNetLemmatizer()
s = "We are putting in efforts to enhance our understanding of Lemmatization"
token_list = s.split()
print("The tokens are: ", token_list)
lemmatized_output = ' '.join([lemmatizer.lemmatize(token) for token in token_list])
print("The lemmatized output is: ", lemmatized_output)

The tokens are:  ['We', 'are', 'putting', 'in', 'efforts', 'to', 'enhance', 'our', 'understanding', 'of', 'Lemmatization']
The lemmatized output is:  We are putting in effort to enhance our understanding of Lemmatization


As can be seen, the WordNet lemmatizer did not do much here. Out of are, putting,
efforts, and understanding, none were converted to their base form.

The WordNet lemmatizer works well if the `POS tags` are also provided as inputs.
It is really impossible to manually annotate each word with its POS tag in a text corpus.

## POS Tagging

In [20]:
## nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(token_list)
pos_tags

[('We', 'PRP'),
 ('are', 'VBP'),
 ('putting', 'VBG'),
 ('in', 'IN'),
 ('efforts', 'NNS'),
 ('to', 'TO'),
 ('enhance', 'VB'),
 ('our', 'PRP$'),
 ('understanding', 'NN'),
 ('of', 'IN'),
 ('Lemmatization', 'NN')]

As can be seen, a list of tuples of the form (the token and POS tag) is returned by the POS
tagger. Now, the POS tags need to be converted to a form that can be understood by the
WordNet lemmatizer and sent in as input along with the tokens.

## POS tag Mapping

In [21]:
from nltk.corpus import wordnet

##This is a common method which is widely used across the NLP community of practitioners

def get_part_of_speech_tags(token):
    
    """Maps POS tags to first character lemmatize() accepts.
    We are focussing on Verbs, Nouns, Adjectives and Adverbs here."""

    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    
    tag = nltk.pos_tag([token])[0][1][0].upper()
    
    return tag_dict.get(tag, wordnet.NOUN)

## Wordnet Lemmatizer with POS Tag Information


In [22]:
lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token, get_part_of_speech_tags(token)) for token in token_list]
print(' '.join(lemmatized_output_with_POS_information))

We be put in effort to enhance our understand of Lemmatization


The following conversions happened:\
are to __be__\
putting to __put__\
efforts to __effort__\
understanding to __understand__

In [23]:
## Lets compare with snowball stemmer - 
stemmer2 = SnowballStemmer(language='english')
stemmed_sentence = [stemmer2.stem(token) for token in token_list]
print(' '.join(stemmed_sentence))

we are put in effort to enhanc our understand of lemmat


the WordNet lemmatizer makes a sensible and context-aware conversion of
the token into its base form, unlike the stemmer, which tries to chop the affixes from the
token.

## Spacy lemmatizer

This comes up with pretrained models that can parse text and figure out the
various properties of the text, such as POS tags, named-entity tags, and so on, with a simple
function call. The prebuilt models identify the POS tags and assign a lemma to each token,
unlike the WordNet lemmatizer, where the POS tags need to be explicitly provided.

`pip install spacy && python -m spacy download en`

In [None]:
import spacy
nlp = spacy.load('en')
doc = nlp("We are putting in efforts to enhance our understanding of Lemmatization")
" ".join([token.lemma_ for token in doc])

## Stopwords
Stopwords are words such as a, an, the, in, at, and so on that occur frequently in text corpora
and do not carry a lot of information in most contexts. These words, in general, are required
for the completion of sentences and making them grammatically sound.

In [3]:
## nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
", ".join(stop)

"above, few, didn't, all, we're, over, needn, by, they'd, re, then, aren, same, he'll, once, of, m, didn, wasn't, own, your, below, she'll, couldn, no, it's, doing, d, me, be, ours, until, at, should've, isn't, wasn, do, we'd, whom, you're, further, it'd, have, was, on, other, shan, they're, so, before, yourself, hasn, with, it, you've, t, him, any, this, and, both, those, again, yourselves, s, hers, but, out, most, myself, his, too, being, into, or, for, very, just, mustn't, wouldn't, we, than, ain, such, yours, its, them, she, hasn't, he, weren, after, i've, if, itself, wouldn, her, are, should, he's, mightn, there, ve, couldn't, don, herself, that, won't, had, you'll, won, been, is, it'll, now, ma, their, about, did, does, my, needn't, they've, she's, against, how, during, up, some, you, not, i, she'd, themselves, weren't, o, ourselves, a, shan't, mightn't, we've, has, ll, these, hadn, only, while, i'm, where, in, having, i'd, our, himself, mustn, each, doesn't, haven, haven't, that

If you look closely, you'll notice that Wh- words such as who, what, when, why, how, which,
where, and whom are part of this list of stopwords; however, in one of the previous sections,
it was mentioned that these words are very significant in use cases such as question
answering and question classification. Measures should be taken to ensure that these words
are not filtered out when the text corpus undergoes stopword removal. 

In [5]:
wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']

stop = set(stopwords.words('english'))

sentence = "how are we putting in efforts to enhance our understanding of Lemmatization"

## Eliminating the required Wh words from our stop words to make our collection healthy for checking question/answer context
for word in wh_words:
    stop.remove(word)

sentence_after_stopword_removal = [token for token in sentence.split() if token not in stop]
" ".join(sentence_after_stopword_removal)

'how putting efforts enhance understanding Lemmatization'

The stopwords are, we, in, to,
our, and of were removed from the sentence. 

## Case Folding
As part of case
folding, all the letters in the text corpus are converted to lowercase. The and the will be
treated the same in a scenario of case folding,

In [6]:
s = "We are putting in efforts to enhance our understanding of Lemmatization"
s = s.lower()
s

'we are putting in efforts to enhance our understanding of lemmatization'

## N-grams

Until now, we have focused on tokens of size 1, which means only one word. Sentences
generally contain names of people and places and other open compound terms, such as
living room and coffee mug. These phrases convey a specific meaning when two or more
words are used together. When used individually, they carry a different meaning
altogether and the inherent meaning behind the compound terms is somewhat lost. The
usage of multiple tokens to represent such inherent meaning can be highly beneficial for the
NLP tasks being performed.

When n is equal to 1,
these are termed as unigrams. Bigrams, or 2-grams, refer to pairs of words, such as dinner
table. Phrases such as the United Arab Emirates comprising three words are termed as
trigrams or 3-grams. This naming system can be extended to larger n-grams, but most NLP
tasks use only trigrams or lower.

In [7]:
## Bi-grams
from nltk.util import ngrams
s = "Natural Language Processing is the way to go"
tokens = s.split()
bigrams = list(ngrams(tokens, 2))
[" ".join(token) for token in bigrams]

['Natural Language',
 'Language Processing',
 'Processing is',
 'is the',
 'the way',
 'way to',
 'to go']

In [9]:
## Tri-grams
s = "Natural Language Processing is the way to go"
tokens = s.split()
trigrams = list(ngrams(tokens, 3))
[" ".join(token) for token in trigrams]

['Natural Language Processing',
 'Language Processing is',
 'Processing is the',
 'is the way',
 'the way to',
 'way to go']

## Building a basic vocabulary

In [10]:
s = "Natural Language Processing is the way to go"
tokens = set(s.split())
vocabulary = sorted(tokens)
vocabulary

['Language', 'Natural', 'Processing', 'go', 'is', 'the', 'to', 'way']

## Removing HTML Tags

In [11]:
html = "<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>"
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

My First HeadingMy first paragraph.
