# Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction based algorithms, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence. It's far more advanced

In [0]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

In [0]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")


# notice the justification strategy. we had to stringify it for justification
# since token object support no f-string formatting
for token in doc1:
    print(f"{str(token):15} {token.lemma_}")

I               -PRON-
am              be
a               a
runner          runner
running         run
in              in
a               a
race            race
because         because
I               -PRON-
love            love
to              to
run             run
since           since
I               -PRON-
ran             run
today           today


### Function to display lemmas
Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly.

In [0]:
def show_lemma(text):
  doc = nlp(text)
  for t in doc:
    print(f"{str(t):{15}} {str(t.dep_):{10}} {t.lemma_}")

Here we're using an **f-string** to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.

In [22]:
doc2 = u"I saw eighteen mice today!"
show_lemma(doc2)

I               nsubj    -PRON-
saw             ROOT     see
eighteen        nummod   eighteen
mice            dobj     mouse
today           npadvmod today
!               punct    !


<font color=green>Notice that the lemma of `saw` is `see`, `mice` is the plural form of `mouse`, and yet `eighteen` is its own number, *not* an expanded form of `eight`.</font>

In [24]:
doc3 ="I am meeting him tomorrow at the meeting."

show_lemma(doc3)

I               nsubj      -PRON-
am              aux        be
meeting         ROOT       meet
him             dobj       -PRON-
tomorrow        npadvmod   tomorrow
at              prep       at
the             det        the
meeting         pobj       meeting
.               punct      .


Here the lemma of `meeting` is determined by its Part of Speech tag.

In [25]:
doc4 = "That's an enormous automobile"

show_lemma(doc4)

That            nsubj      that
's              ROOT       be
an              det        an
enormous        amod       enormous
automobile      attr       automobile


Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.


We should point out that although lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. In an upcoming lecture we'll investigate *word vectors and similarity*.



### We can see that lemmatization is a far better alternative to stemming since it is far more correct for getting the root words from morphologically affixed words. We will use this more I guess. 