# Lemmatization

Lemmatization is related to stemming but it is an advanced form of Stemming. It ensures that root word belongs to the language. Therefore we will get valid base or root words.

### Stemming vs Lemmatization
Stemming might not result in actual word, whereas lemmatization does conversion properly with the use of vocabulary, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

In [30]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

In [32]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [14]:
lemmatizer = WordNetLemmatizer()

In [15]:
def lemmatize_text(text):
  tokens = word_tokenize(text)
  lemmas = [lemmatizer.lemmatize(token) for token in tokens]
  return lemmas

In [16]:
input_text = "SpaceX is an American aerospace manufacturer, space transportation services and communications company headquartered in Hawthorne, California. It was established by Elon Musk"

lemmatize_text(input_text)

['SpaceX',
 'is',
 'an',
 'American',
 'aerospace',
 'manufacturer',
 ',',
 'space',
 'transportation',
 'service',
 'and',
 'communication',
 'company',
 'headquartered',
 'in',
 'Hawthorne',
 ',',
 'California',
 '.',
 'It',
 'wa',
 'established',
 'by',
 'Elon',
 'Musk']

In [60]:
def lemmatize_text_with_pos_tag(text):
  lemmas = []
  for token, tag in pos_tag(word_tokenize(text)):
    pos = tag[0].lower()
    if pos not in ['a', 'r', 'n', 'v']:
        pos='n'
    lemmas.append(lemmatizer.lemmatize(token, pos))
  return lemmas

In [62]:
lemmatize_text_with_pos_tag(input_text)

['SpaceX',
 'be',
 'an',
 'American',
 'aerospace',
 'manufacturer',
 ',',
 'space',
 'transportation',
 'service',
 'and',
 'communication',
 'company',
 'headquarter',
 'in',
 'Hawthorne',
 ',',
 'California',
 '.',
 'It',
 'be',
 'establish',
 'by',
 'Elon',
 'Musk']

## spaCy

In [17]:
import spacy

In [19]:
nlp = spacy.load('en_core_web_sm')

In [28]:
def lemmatize_text_spacy(text):
  text = nlp(text)
  text = [word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text]
  return text

In [29]:
lemmatize_text_spacy(input_text)

['SpaceX',
 'be',
 'an',
 'american',
 'aerospace',
 'manufacturer',
 ',',
 'space',
 'transportation',
 'service',
 'and',
 'communication',
 'company',
 'headquarter',
 'in',
 'Hawthorne',
 ',',
 'California',
 '.',
 'It',
 'be',
 'establish',
 'by',
 'Elon',
 'Musk']