# Lemmatization

Before lemmatization, document have to been tokenizated.

In [2]:
# Capture from https://en.wikipedia.org/wiki/Lemmatisation

document = "Lemmatisation (or lemmatization) in linguistics is the process of grouping together \
the inflected forms of a word so they can be analysed as a single item, identified by the word's \
lemma, or dictionary form."

## Using spaCy

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [5]:
doc = nlp(document)
tokens = [token.text for token in doc]

print("Original Document: \n{}".format(document))
print("===" * 10)

for token in doc:
    if token.text != token.lemma_:
        print("Original : {0}, New: {1}".format(token.text, token.lemma_))

Original Document: 
Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.
Original : Lemmatisation, New: lemmatisation
Original : linguistics, New: linguistic
Original : is, New: be
Original : grouping, New: group
Original : inflected, New: inflect
Original : forms, New: form
Original : they, New: -PRON-
Original : analysed, New: analyse
Original : identified, New: identify


spaCy will convert word to lower case and changing past tense, gerund form (other tenses as well) to present tense. Also, "they" normalize to `"-PRON-"` which is pronoun.

## Using NLTK

In [6]:
import nltk

nltk.download("wordnet")
wordnet_lemmatizer = nltk.stem.WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ThanhNS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
tokens = nltk.word_tokenize(document)

print("Original Document: \n{}".format(document))
print("===" * 10)

for token in tokens:
    lemmatized_token = wordnet_lemmatizer.lemmatize(token)
    
    if token != lemmatized_token:
        print("Original : {0}, New: {1}".format(token, lemmatized_token))

Original Document: 
Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.
