# Lemmatization

In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.

In [1]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

In [3]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

I 	 PRON 	 757862 	 -PRON-
am 	 VERB 	 536 	 be
a 	 DET 	 506 	 a
runner 	 NOUN 	 527611 	 runner
running 	 VERB 	 1022 	 run
in 	 ADP 	 522 	 in
a 	 DET 	 506 	 a
race 	 NOUN 	 1598 	 race
because 	 ADP 	 636 	 because
I 	 PRON 	 757862 	 -PRON-
love 	 VERB 	 949 	 love
to 	 PART 	 504 	 to
run 	 VERB 	 1022 	 run
since 	 ADP 	 892 	 since
I 	 PRON 	 757862 	 -PRON-
ran 	 VERB 	 1022 	 run
today 	 NOUN 	 1188 	 today


<font color=green>In the above sentence, `running`, `run` and `ran` all point to the same lemma `run` to avoid duplication.</font>

### Function to display lemmas
Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly.

In [4]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

Here we're using an **f-string** to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.

In [5]:
doc2 = nlp(u"I saw eighteen mice today!")

show_lemmas(doc2)

I            PRON   757862                 -PRON-
saw          VERB   678                    see
eighteen     NUM    275550                 eighteen
mice         NOUN   4015                   mouse
today        NOUN   1188                   today
!            PUNCT  558                    !


<font color=green>Notice that the lemma of `saw` is `see`, `mice` is the plural form of `mouse`, and yet `eighteen` is its own number, *not* an expanded form of `eight`.</font>

In [6]:
doc3 = nlp(u"I am meeting him tomorrow at the meeting.")

show_lemmas(doc3)

I            PRON   757862                 -PRON-
am           VERB   536                    be
meeting      VERB   2568                   meet
him          PRON   757862                 -PRON-
tomorrow     NOUN   3621                   tomorrow
at           ADP    584                    at
the          DET    501                    the
meeting      NOUN   4027                   meeting
.            PUNCT  453                    .


<font color=green>Here the lemma of `meeting` is determined by its Part of Speech tag.</font>

In [7]:
doc4 = nlp(u"That's an enormous automobile")

show_lemmas(doc4)

That         DET    514                    that
's           VERB   536                    be
an           DET    591                    an
enormous     ADJ    5713                   enormous
automobile   NOUN   279811                 automobile


<font color=green>Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.</font>

We should point out that although lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. In an upcoming lecture we'll investigate *word vectors and similarity*.