# Lemmatization

Lemmatization is the process in which we transform the word into a form with a different word 
category. The word formed after lemmatization is entirely different. The built-in morphy() function 
is used for lemmatization in WordNetLemmatizer. The inputted word is left unchanged if it is not 
found in WordNet. In the argument, pos refers to the part of speech category of the inputted word.
Consider an example of lemmatization in NLTK:

In [6]:
import nltk

In [1]:
from nltk.stem import WordNetLemmatizer

In [2]:
lemmatizer_output=WordNetLemmatizer()

In [7]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rosel\AppData\Roaming\nltk_data...


True

In [8]:
lemmatizer_output.lemmatize('working')


'working'

In [9]:
lemmatizer_output.lemmatize('working',pos='v')

'work'

In [10]:
lemmatizer_output.lemmatize('works')

'work'

The WordNetLemmatizer library may be defined as a wrapper around the so-called WordNet 
corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a 
lemma. If no lemma is extracted, then the word is only returned in its original form. For example, 
for works , the lemma returned is the singular form, work .
Let's consider the following code that illustrates the difference between stemming and 
lemmatization :

In [11]:
from nltk.stem import PorterStemmer

In [13]:
stemmer_output=PorterStemmer()

In [14]:
stemmer_output.stem('happiness')

'happi'

In [15]:
from nltk.stem import WordNetLemmatizer

In [16]:
lemmatizer_output=WordNetLemmatizer()

In [17]:
lemmatizer_output.lemmatize('happiness')

'happiness'

In the preceding code, happiness is converted to happi by stemming.
Lemmatization doesn't find the root word for happiness , so it returns the word
happiness.


Similarity measure

In [18]:
from nltk.metrics import *

In [19]:
edit_distance("relate","relation")

3

In [20]:
edit_distance("suggestion","calculation")


7

Applying similarity measures using Jaccard's Coefficient.
Jaccard's coefficient, or Tanimoto coefficient, may be defined as a measure of the overlap of two 
sets, X and Y.

It may be defined as follows:

• Jaccard(X,Y)=|X∩Y|/|XUY|
• Jaccard(X,X)=1
• Jaccard(X,Y)=0 if X∩Y=0

Example :

>>> import nltk

>>> from nltk.metrics import *

>>> X=set([10,20,30,40])

>>> Y=set([20,30,60])

>>> print(jaccard_distance(X,Y))

0,6

Good to know :

For others tests, more than hundred corpus are available, provided by NLTK at : 
http://www.nltk.org/nltk_data/
