# Stemming and Lemmatization

One common problem with standardizing text is standardizing all parts of a word to its root, stem, or prefix. This is useful as it allows us to analyze word meaning without having to pour over separate inflectional forms of a word. It also speeds up the process as we have fewer words.

[What's the difference between stemming and lemmatizing?](https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming)

"**Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. **Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. 

If confronted with the token "saw", **stemming** might return just "s", whereas **lemmatization** would attempt to return either "see" or "saw" depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma." See [this post](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) for more information.

So what do we do? Use pretrained models from [spaCy](https://spacy.io/)! It does all the tokenization and lemmatization for you.

[Read more about spaCy's pretrained models](https://spacy.io/models)

# Lemmatizing with spaCy

Let's start by loading the trained model. We don't need the Named Entity Recognition (NER) or the text classification capabilities of the model, so we don't load them to make everything faster. If we wanted to lemmatize a language other than English, we would just need to download a trained model for that language using spaCy and change the 'en_core_web_sm' to the name of that model. Everything else would be the same.

In [None]:
# Load the small pretrained model
nlp = spacy.load("en_core_web_sm", disable=["ner", "textcat"])
print(type(nlp))

In [None]:
# spaCy expects a string as input, so let's use .join to force our list into a string  
words = ' '.join(tokens)
words

In [None]:
doc = nlp(words)
doc

In [None]:
# We can now get the lemma of any word using the .lemma_ attribute
doc[5].lemma_

In [None]:
# ...Or even the part of speech
doc[5].pos_

In [None]:
# Note that spaCy also has its own stopwords list
nlp.Defaults.stop_words

Check out the [spaCy documentation](https://spacy.io/api/token#attributes) for more information about all the linguistic features that spaCy allows you to access as attributes.

Now, let's create a function that takes in a list of tokens and lemmatizes it using spaCy.

In [None]:
# Define our function
def lemmatize(tokens):
    """Return the lemmas for each word in `tokens`."""
    
    # spacy models operate on strings, not lists, so we turn the tokens back into
    # a string of words
    words = ' '.join(tokens)
    
    # this line does all sorts of processing, including the lemmatization.
    # `doc` will be like a list of tokens that we can iterate over
    doc = nlp(words)
    
    # each token in `doc` holds information about that token. The `lemma_`
    # attribute holds the lemma of that token represented as a string. For
    # performance reasons, the `lemma` (without the trailing underscore) holds
    # an integer representation of the token, that we'll rarely ever need.
    return [token.lemma_ for token in doc]

In [None]:
lemmas = lemmatize(tokens)
print(lemmas)

# Notice that spacy lemmatizes pronouns (e.g. "you", "I", "your") in a funny way.
# It just tells us that they are pronouns, rather than giving us something like
# "your" -> "you".


# N-grams, skip-grams, and BERT?

Are you interested in tokenizing more than just single words for the purpose of increasing "context"? [n-grams](https://en.wikipedia.org/wiki/N-gram) are "contiguous sequence of n items from a given sample of text or speech."

[Check out this clever solution for n-gramizing text](https://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams)

Do you want even more context? We will learn more about [skip-grams](https://en.wikipedia.org/wiki/Word2vec) and [BERT](https://www.searchenginejournal.com/google-bert-update/332161/#close) later in this course. 


"The skip-gram architecture weighs nearby context words more heavily than more distant context words."

"The BERT algorithm (Bidirectional Encoder Representations from Transformers) is a deep learning algorithm related to natural language processing. It helps a machine to understand what words in a sentence mean, but with all the nuances of context."