<CENTER> <H1>Program for lemmatizing words Using WordNet </H1></CENTER

Lemmatization is beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words, aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

In [None]:
import nltk 
nltk.download('all') 

# WordNet lemmatization without POS tagging

In [21]:
# Importing the package
from nltk.stem import WordNetLemmatizer

# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()
 
# single word lemmatization examples
list1 = ['KITES','BABIES','DOGS','FLYING','SMILING','DRIVING',
         'DIED','TRIED','FEET','CALFS','CHILDREN','WOMEN']
for words in list1:
    print(words + " ---> " + wnl.lemmatize(words))

print('*'*30,'AFTER CONVERTING TO LOWERCASE','*'*30)


# single word lemmatization examples
list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling',
         'driving', 'died', 'tried', 'feet','calfs','children','women']
for words in list1:
    print(words + " ---> " + wnl.lemmatize(words))


KITES ---> KITES
BABIES ---> BABIES
DOGS ---> DOGS
FLYING ---> FLYING
SMILING ---> SMILING
DRIVING ---> DRIVING
DIED ---> DIED
TRIED ---> TRIED
FEET ---> FEET
CALFS ---> CALFS
CHILDREN ---> CHILDREN
WOMEN ---> WOMEN
****************************** AFTER CONVERTING TO LOWERCASE ******************************
kites ---> kite
babies ---> baby
dogs ---> dog
flying ---> flying
smiling ---> smiling
driving ---> driving
died ---> died
tried ---> tried
feet ---> foot
calfs ---> calf
children ---> child
women ---> woman


In [13]:
# sentence lemmatization examples
string = ('the cat is sitting with the bats on the striped mat under many badly flying geese')

# Converting String into tokens
list2 = nltk.word_tokenize(string)
print(list2)

lemmatized_string = ' '.join([wnl.lemmatize(words) for words in list2])

print(lemmatized_string)

['the', 'cat', 'is', 'sitting', 'with', 'the', 'bats', 'on', 'the', 'striped', 'mat', 'under', 'many', 'badly', 'flying', 'geese']
the cat is sitting with the bat on the striped mat under many badly flying goose


# WordNet lemmatization with POS tagging

In [9]:
from nltk.corpus import wordnet

In [6]:
# Define function to lemmatize each word with its POS tag
# POS_TAGGER_FUNCTION : TYPE 1
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None

In [7]:
sentence = 'the cat is sitting with the bats on the striped mat under many badly flying geese'
 
# tokenize the sentence and find the POS tag for each token
pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence)) 
 
print(pos_tagged)

[('the', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('with', 'IN'), ('the', 'DT'), ('bats', 'NNS'), ('on', 'IN'), ('the', 'DT'), ('striped', 'JJ'), ('mat', 'NN'), ('under', 'IN'), ('many', 'JJ'), ('badly', 'RB'), ('flying', 'VBG'), ('geese', 'JJ')]


In [10]:
# we use our own pos_tagger function to make things simpler to understand.
wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))
print(wordnet_tagged)

[('the', None), ('cat', 'n'), ('is', 'v'), ('sitting', 'v'), ('with', None), ('the', None), ('bats', 'n'), ('on', None), ('the', None), ('striped', 'a'), ('mat', 'n'), ('under', None), ('many', 'a'), ('badly', 'r'), ('flying', 'v'), ('geese', 'a')]


In [12]:
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in wordnet_tagged:
    if tag is None:
        # if there is no available tag, append the token as is
        lemmatized_sentence.append(word)
    else:       
        # else use the tag to lemmatize the token
        lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
lemmatized_sentence = " ".join(lemmatized_sentence)
 
print(lemmatized_sentence)

the cat be sit with the bat on the striped mat under many badly fly geese


# <h2> Inference </h2>

1. The lemmatization doesn't work properly, if the words are in uppercase.
2. If we notice the above words, the plural forms are converted to the singular form.
3. The general lemmatization, doesn't trunk the suffix 'ing','ed'.
4. The general sentence lemmatization without POS tagging doesn't trunk the word properly.
5. We have given the statement as "the cat is sitting with the bats on the striped mat under many badly flying geese", but we could see that the lemmatization using the POS tagging gave the exact content of the sentance, where as without the POS tagging, the content explaination was missing. 