# Lesson 4: Lemmatization Nuances in Natural Language Processing with spaCy

### Lesson Overview

Welcome back! As we move ahead in our Natural Language Processing journey, today's lesson is about a fundamental component of NLP preprocessing — **Lemmatization**. We will get hands-on with the spaCy library to implement lemmatization on our text data.

By the end of the lesson, you should be skilled in explaining and implementing lemmatization in your data preprocessing pipeline for NLP tasks.

### Understanding Lemmatization

Lemmatization, in the context of Natural Language Processing, is the process of reducing any given word to its base form or root.

Let's take an example: suppose we have a verb in its past tense, like **flying**. The base form of flying is **fly**. If we perform lemmatization on flying, we get fly. On similar lines, **better** would be reduced to **good**, **mice** would become **mouse**, and so on.

So, why lemmatization? Well, while dealing with natural language, it happens quite frequently that we encounter different forms of the same word. For a machine, **better**, **good**, and **best** are different words, even though they essentially express the same thing. When we perform tasks like text classification, these different forms are treated as different features, thus increasing the dimensionality of our dataset. By lemmatizing, we can reduce these variations to their root form, thereby reducing the number of features and making our model more efficient.

### spaCy's Capability on Lemmatization

spaCy offers a convenient and efficient way to perform lemmatization on text. When spaCy processes any text, it performs lemmatization by default and keeps the lemma (or root form) of each word as an attribute of the word. This attribute can be accessed by simply calling `token.lemma_`, where `token` is the word we're dealing with.

Now, let's move onto the practical implementation.

### Implementing Lemmatization using spaCy

Let's use the provided task as an example to perform lemmatization on a sentence.
```python
import spacy

nlp = spacy.load('en_core_web_sm')
sentence = "The striped bats are hanging on their feet and ate best fishes"
doc = nlp(sentence)

for token in doc:
    print(token.text, token.lemma_)
```
In the above code, we initially load the English language model using `nlp = spacy.load("en_core_web_sm")`. We then use this model to process our sentence and convert it to a doc, which is essentially a collection of tokens (or words).

Finally, we iterate over each token in the doc and print the token and its corresponding lemma. The lemma of a token can be accessed using the `lemma_` attribute of the token.

The output of the above code will be:



The the striped stripe bats bat are be hanging hang on on their their feet foot and and ate eat best good fishes fish


This output demonstrates how each word from our sentence is processed and reduced to its lemma form. Notice how "bats" is converted to "bat", and "ate" to "eat", showcasing the effectiveness of lemmatization in normalizing text.

### Use Case for Lemmatization

So how does this help in real-world Natural Language Processing tasks? Lemmatization reduces the various inflected forms of a word to a single form. This can significantly reduce the number of unique words in our text (which, in case of text data, means reducing the number of features) without losing significant meaning.

Text classification, sentiment analysis, and topic modeling are just a few NLP tasks that can significantly benefit from the dimensionality reduction lemmatization offers. By making the dataset more manageable, more computationally efficient, and more accurate machine learning models can be built.

### Lesson Summary and Practice

Well done on reaching this point! Today you learned about lemmatization, its importance in NLP data preprocessing, and how to utilize spaCy to perform lemmatization. This knowledge is an integral part of any NLP pipeline and will assist you greatly in future tasks.

Up next, we'll be practicing using spaCy's lemmatization functionality on actual datasets and assessing its effects on our text data. This will reinforce your understanding and further boost your spaCy mastery! See you in the next lesson.


## Change the Sentence for Lemmatization

Stellar Navigator, you're doing a fantastic job! For a small twist, change the sentence in our code to "Explorers venture through the immense void separating our planet from the red Martian surface, seeking new horizons". Run the lemmatization process using spaCy and observe the output. Keep going!

```python
import spacy

nlp = spacy.load('en_core_web_sm')

sentence = "Telescopes uncover the striking hues of far-off nebulae and galaxies, mesmerizing observers across the globe."
doc = nlp(sentence)

for token in doc:
    print(token.text, token.lemma_)
```

Let's modify the sentence in the code to "Explorers venture through the immense void separating our planet from the red Martian surface, seeking new horizons" and run the lemmatization process using spaCy. Here’s the updated code:
```python
import spacy

nlp = spacy.load('en_core_web_sm')

sentence = "Explorers venture through the immense void separating our planet from the red Martian surface, seeking new horizons." doc = nlp(sentence)

for token in doc: print(token.text, token.lemma_)
```
Now, let's execute this code to observe the output.

It seems that the spaCy model en_core_web_sm is not available in the current environment, which is causing the error. However, I can provide you with the expected output based on the lemmatization process for the given sentence.

If you run the modified code in your local environment where spaCy is installed, you should see an output similar to this:

Explorers explorer
venture venture
through through
the the
immense immense
void void
separating separate
our our
planet planet
from from
the the
red red
Martian Martian
surface surface
seeking seek
new new
horizons horizon


This output shows how each word from the sentence is processed and reduced to its lemma form. If you have any further questions or need assistance with something else, feel free to ask!

## Lemmatizing Reuters Dataset with spaCy

Glad to see you here, Stellar Navigator! Your task is to adjust the raw_text value in our code. Currently, it lemmatizes a string about Mars, but let's examine a real-world dataset. Please replace that string with the first 100 characters from the Reuters dataset, obtained using the 'test/14826' file ID. Observe the lemma output!

```python
import spacy
import nltk
from nltk.corpus import reuters 

# Download Reuters dataset
nltk.download('reuters')

# Load English language model from spaCy
nlp = spacy.load('en_core_web_sm')

# Get the raw text from the sentence and limit it to the leading 100 characters
raw_text = "Mars is the fourth planet in our solar system"

# Process the sentence using the English language model
doc = nlp(raw_text)

# Loop through each token in doc
for token in doc:
    # Print each token's text and lemma
    print(token.text, token.lemma_)

```

Let's adjust the code to replace the `raw_text` value with the first 100 characters from the Reuters dataset using the 'test/14826' file ID. Here’s the updated code:
```python
import spacy
import nltk
from nltk.corpus import reuters

# Download Reuters dataset
nltk.download('reuters')

# Load English language model from spaCy
nlp = spacy.load('en_core_web_sm')

# Get the raw text from the Reuters dataset and limit it to the leading 100 characters
raw_text = reuters.raw('test/14826')[:100]

# Process the sentence using the English language model
doc = nlp(raw_text)

# Loop through each token in doc
for token in doc:
    # Print each token's text and lemma
    print(token.text, token.lemma_)
```

This code will fetch the first 100 characters from the specified Reuters article and perform lemmatization on it. If you run this code in your local environment, you should see the lemmatized output for the text extracted from the Reuters dataset. 

If you need further assistance or have any questions, feel free to ask!

## Lemmatization on Reuters Dataset with spaCy

Good effort, Space Voyager! Here's another checkpoint. Fill in the blanks (_____). Download the reuters data from nltk, then retrieve the raw text of the file ID test/14826 and limit it to the leading 50 characters. Further, use spaCy to process the sentences and print the root form (lemma) of each token. Carry on!

```python
import ____ 
import nltk
from nltk.corpus import ____ 

nltk.download('reuters')
# Specify a fileid for the reuters corpus
raw_text = ____('test/14826')[:50]

nlp = ____.load('en_core_web_sm')
sentence = raw_text
doc = ____(sentence)

for _____ in doc:
    print(_____.text, _____.lemma_)

```



Here’s the completed code with the blanks filled in:
```python
import spacy
import nltk
from nltk.corpus import reuters

nltk.download('reuters')
# Specify a fileid for the reuters corpus
raw_text = reuters.raw('test/14826')[:50]

nlp = spacy.load('en_core_web_sm')
sentence = raw_text
doc = nlp(sentence)

for token in doc:
    print(token.text, token.lemma_)
```
This code will download the Reuters dataset, retrieve the raw text from the specified file ID, limit it to the first 50 characters, and then use spaCy to process the text and print the lemma of each token. If you have any further questions or need additional assistance, feel free to ask!

## Integrating Lemmatization into Text Processing Pipeline

Good job, Stellar Navigator! Now, add a crucial line of code to finish the lemmatization pipeline. Fill in the TODO comment by appending lemmas of non-stopwords and non-punctuation characters to our list. Be mindful of our dear Reuters dataset when illustrating your prowess!

```python
import spacy
import nltk
from nltk.corpus import reuters 

# Set the verbosity level to False to suppress informational messages
nltk.download('reuters', quiet=True)
raw_text = reuters.raw('test/14828')[:50]

nlp = spacy.load('en_core_web_sm')
sentence = raw_text
doc = nlp(sentence)

# Initialize an empty list to store the lemmas
lemmas = []

# Loop through each token in the doc
for token in doc:
    # TODO: Add the non-stopwords and non-punctuation token's lemma to the list


# Print the list of lemmas
print(lemmas)
```

To complete the lemmatization pipeline, you can fill in the TODO comment by appending the lemmas of non-stopwords and non-punctuation characters to the `lemmas` list. Here's the updated code:

```python
import spacy
import nltk
from nltk.corpus import reuters

# Set the verbosity level to False to suppress informational messages
nltk.download('reuters', quiet=True)
raw_text = reuters.raw('test/14828')[:50]

nlp = spacy.load('en_core_web_sm')
sentence = raw_text
doc = nlp(sentence)

# Initialize an empty list to store the lemmas
lemmas = []

# Loop through each token in the doc
for token in doc:
    # Add the non-stopwords and non-punctuation token's lemma to the list
    if not token.is_stop and not token.is_punct:
        lemmas.append(token.lemma_)

# Print the list of lemmas
print(lemmas)
```

In this code, the condition checks if the token is not a stopword and not punctuation before appending its lemma to the `lemmas` list. This ensures that only meaningful words are included in the final output.

## Lemmatization with spaCy on the Reuters Dataset

Fantastic progress, Stellar Navigator! Now, put your acquired skills to the test. Write a Python program that retrieves the Reuters dataset file 'test/14829' and uses spaCy's English language model to tokenize and lemmatize the text, filtering out all stop words along the way. Finally, print each token alongside its corresponding lemma.

```python
import spacy
import nltk
from nltk.corpus import reuters 

# TODO: Download the Reuters dataset

# TODO: Load the English language model from spaCy

# TODO: Get the raw text from the 'test/14829' file in the Reuters dataset

# TODO: Process the sentence using the English language model

# TODO: Loop through each token in doc, and print each token's text and lemma, filtering stop words along the way

```

Here's a complete Python program that retrieves the Reuters dataset file 'test/14829', uses spaCy's English language model to tokenize and lemmatize the text, and filters out all stop words while printing each token alongside its corresponding lemma:

```python
import spacy
import nltk
from nltk.corpus import reuters

# Download the Reuters dataset
nltk.download('reuters', quiet=True)

# Load the English language model from spaCy
nlp = spacy.load('en_core_web_sm')

# Get the raw text from the 'test/14829' file in the Reuters dataset
raw_text = reuters.raw('test/14829')

# Process the sentence using the English language model
doc = nlp(raw_text)

# Loop through each token in doc, and print each token's text and lemma, filtering stop words along the way
for token in doc:
    if not token.is_stop and not token.is_punct:
        print(f'Token: {token.text}, Lemma: {token.lemma_}')
```

In this program, the following steps are performed:

1. The Reuters dataset is downloaded using NLTK.
2. The English language model is loaded from spaCy.
3. The raw text from the specified Reuters file is retrieved.
4. The text is processed to create a `doc` object.
5. Each token is checked to ensure it is not a stop word or punctuation, and then both the token and its lemma are printed.