# Lesson 5: Understanding and Implementing POS Tagging with spaCy

Here's the provided text converted to Markdown format:


# Introduction to the Lesson

Hello and welcome to this lesson on **Understanding and Implementing Part-of-speech (POS) Tagging with spaCy**!

In this lesson, we'll discuss what POS tagging means, its importance in Natural Language Processing (NLP), and how we can effortlessly perform it using spaCy. By the end of this lesson, you should be able to process a text and tag each token (word) with its corresponding POS using spaCy.

## Introduction to POS Tagging

POS tagging is the process of assigning a part-of-speech label (noun, verb, adjective, etc.) to each token (word) in a given text. For example, in the sentence "Sam eats quickly.", "Sam" is a noun, "eats" is a verb, and "quickly" is an adverb. This is important because the meaning of a sentence can significantly be determined by the POS of the words in the sentence.

When we perform POS tagging, it not only identifies the POS of a word, but also its grammatical use within the sentence. For instance, "book" can be a noun ("Sam reads a book.") or a verb ("Book a ticket for me."), and POS tagging helps in distinguishing between these uses.

In NLP tasks like parsing, text-to-speech conversion, machine translation, and extraction of relationships and entities, POS tagging plays a crucial role. For example, in information extraction, if you want to extract all named entities that are 'organizations' from some text, knowing that a word is a proper noun (NNP in the detailed Penn Treebank POS tags set) may not be enough; you would need its context among other words in the text.

## Understanding POS Tagging Implementation with spaCy

Implementing POS tagging in spaCy is pretty straightforward. However, it's important to note that POS tagging in spaCy is statistical, meaning it is based on statistical models that consider the context of the words in the text. When we process a text with the `nlp` object, spaCy tokenizes the text to create a `Doc` object. This `Doc` object carries all the computed attributes and properties that we can delve into. For POS tagging, we focus on two token attributes:

- **pos_**: This is the simple part-of-speech tag, using the Universal POS tag set. It provides a general POS tag, like 'NOUN', 'VERB', 'ADV', etc.
- **tag_**: This is the detailed part-of-speech tag using the Penn Treebank POS tag set. It provides detailed POS information, like 'VBZ' (verb, 3rd person singular present), 'RB' (adverb), etc.

## Performing POS Tagging on a Sample Text Using spaCy

The power of any learning lies in the doing. Let's roll up our sleeves and dive into some code. For POS tagging with spaCy, we need to process the text and loop through the token properties of the processed `Doc` object. Let's look at how to do that.

First, we import the spaCy library and load the English language model.

```python
import spacy
nlp = spacy.load("en_core_web_sm")

Define a sentence of English text that we want to perform POS tagging on:

text = "I am learning NLP and using spaCy for POS tagging."

Process this text using the `nlp` function to create a `Doc` object:

doc = nlp(text)

Perform POS tagging on each token in the `Doc` object using a for loop:

for token in doc:
    print(f"{token.text:{10}} {token.lemma_:{10}} {token.pos_:{10}} {token.tag_:{10}}")
```

This small piece of code will give you the POS tagging information for each word in the text, providing the word (`token.text`), its base form (`token.lemma_`), simple POS (`token.pos_`), and detailed POS tag (`token.tag_`).

## Understanding the Output and Next Steps

The output of the above code will be:

```sh
I          I          PRON       PRP       
am         be         AUX        VBP       
learning   learn      VERB       VBG       
NLP        NLP        PROPN      NNP       
and        and        CCONJ      CC        
using      use        VERB       VBG       
spaCy      spacy      NOUN       NN        
for        for        ADP        IN        
POS        POS        PROPN      NNP       
tagging    tagging    NOUN       NN        
.          .          PUNCT      .         
```

This output illustrates the tokenization and POS tagging of a text. Each row represents a token (word or punctuation) from the original text. The columns include the token itself, its lemma (base form), its simple POS tag, and its detailed POS tag. This tagging helps in understanding not just the role of each word in the sentence, but also its base form, which is crucial for many NLP tasks.

## Lesson Summary and Upcoming Tasks

What a journey that was! We moved from understanding POS tagging to implementing it using spaCy and working through a practical example. You've now gained a significant NLP skill and you should be able to process a text using spaCy and perform POS tagging on it.

As we transition into the practice tasks, it's important to remember that learning is an iterative process. The more you do, the more you learn and reinforce that learning. The next practice tasks will allow you to apply what you've learnt today and cement this new knowledge. Happy tagging!


## Refining Output Format of POS Tagging

Great job, Stellar Navigator! Your next task is to change the printed output format. Currently, it prints each token's details on one line. Modify it so that each token's details are printed in a more intuitive sentence structure like: "The word X is a Part of Speech. In its base form, it is Lemma. The detailed POS tag for X is POS Tag."

```python
import spacy

nlp = spacy.load("en_core_web_sm")
text = "I am learning NLP and using spaCy for POS tagging."
doc = nlp(text)

for token in doc:
    print(f"{token.text} {token.lemma_} {token.pos_} {token.tag_}")

```

```python
import spacy

nlp = spacy.load("en_core_web_sm")
text = "I am learning NLP and using spaCy for POS tagging."
doc = nlp(text)

for token in doc:
    print(f"The word '{token.text}' is a '{token.pos_}'. In its base form, it is '{token.lemma_}'. The detailed POS tag for '{token.text}' is '{token.tag_}'.")
```

This version prints the details of each token in the specified sentence structure without filtering any tokens. Let me know if you need any further adjustments!

## POS tagging on a Real-world Text Document

Awesome work, Stellar Navigator! Now, let's use the same process to perform POS tagging on a real-world text. Your task is to replace the simple English string with the first document from nltk.corpus.reuters. Remember to import the reuters dataset from nltk.corpus and then process the document as a single string. Observe the output of the POS tags on a more complex text.

```python
import spacy

nlp = spacy.load("en_core_web_sm")

text_string = "I am learning NLP and using spaCy for POS tagging."

doc = nlp(text_string)

for token in doc:
    print(f'The word "{token.text}" is a {token.pos_}. In its base form, it is "{token.lemma_}". The detailed POS tag for "{token.text}" is {token.tag_}.')

```

Thank you for the clarification! Here’s the corrected code without the condition to filter out stop words and punctuation:

Here's the updated code that uses the first document from the `nltk.corpus.reuters` dataset for POS tagging. This will allow you to observe the output of the POS tags on a more complex text:

```python
import spacy
import nltk
from nltk.corpus import reuters

# Download the Reuters dataset if not already downloaded
nltk.download('reuters', quiet=True)

# Load the English language model from spaCy
nlp = spacy.load("en_core_web_sm")

# Get the raw text from the first document in the Reuters dataset
text_string = reuters.raw(reuters.fileids()[0])

# Process the document as a single string
doc = nlp(text_string)

# Print the POS tagging details for each token
for token in doc:
    print(f'The word "{token.text}" is a {token.pos_}. In its base form, it is "{token.lemma_}". The detailed POS tag for "{token.text}" is {token.tag_}.')
```

In this code:

1. The Reuters dataset is imported and downloaded if necessary.
2. The first document from the Reuters dataset is retrieved and processed as a single string.
3. The POS tagging details for each token in the document are printed in the specified format.

This will give you a more complex text to analyze. Let me know if you need any further assistance!

## Analyzing Verb Usage in Reuters News

Superb work, Stellar Navigator! Are you ready to make the code encounter verbs in the Reuters news? Fill in the blanks (____) to load the spaCy model, preprocess the text, iterate through the text, tag it, and, finally, print out the top five most frequently used verbs. Let's analyze the Reuters corpus!

```python
import spacy
import pandas as pd
from nltk.corpus import reuters

nlp = ____

text_list = [' '.join(reuters.words(file_id)) for file_id in reuters.fileids()[:5]]
text = ____ 

doc = ____

tokens = []
lemmas = []
pos_tags = []
detailed_pos_tags = []

for token in ____:
    tokens.append(____)
    lemmas.append(____)
    pos_tags.append(____)
    detailed_pos_tags.append(____)

df = pd.DataFrame(list(zip(tokens, lemmas, pos_tags, detailed_pos_tags)), 
                  columns = ['Token', 'Lemma', 'POS', 'Detailed_POS'])

top_verbs =(df[df['POS'] == 'VERB']['Lemma'].value_counts().head(5))
print(____)

```

Here's the completed code to load the spaCy model, preprocess the text, iterate through it, tag it, and print out the top five most frequently used verbs from the Reuters corpus:

```python
import spacy
import pandas as pd
from nltk.corpus import reuters

nlp = spacy.load("en_core_web_sm")

text_list = [' '.join(reuters.words(file_id)) for file_id in reuters.fileids()[:5]]
text = ' '.join(text_list)

doc = nlp(text)

tokens = []
lemmas = []
pos_tags = []
detailed_pos_tags = []

for token in doc:
    tokens.append(token.text)
    lemmas.append(token.lemma_)
    pos_tags.append(token.pos_)
    detailed_pos_tags.append(token.tag_)

df = pd.DataFrame(list(zip(tokens, lemmas, pos_tags, detailed_pos_tags)), 
                  columns = ['Token', 'Lemma', 'POS', 'Detailed_POS'])

top_verbs = df[df['POS'] == 'VERB']['Lemma'].value_counts().head(5)
print(top_verbs)
```

In this code:

- The spaCy model is loaded with `spacy.load("en_core_web_sm")`.
- The text is created by joining the words from the first five Reuters articles.
- The `doc` variable processes the text with spaCy.
- The loop iterates through the tokens in the `doc`, collecting their text, lemmas, part-of-speech tags, and detailed tags.
- Finally, the top five most frequently used verbs are printed.



## Frequency Analysis on Adjectives Using POS Tagging

Great work so far, Stellar Navigator! Are you ready for the next step? Your task is to create a DataFrame containing only adjectives. Then, find and print the top five most frequently used adjectives in our text document. I can't wait to see your result!

```python
import spacy
from nltk.corpus import reuters
import pandas as pd

nlp = spacy.load("en_core_web_sm")

text_list = reuters.words(reuters.fileids()[0])
text = ' '.join(word for word in text_list)

doc = nlp(text)

tokens = []
lemmas = []
pos_tags = []
detailed_pos_tags = []

for token in doc:
    tokens.append(token.text)
    lemmas.append(token.lemma_)
    pos_tags.append(token.pos_)
    detailed_pos_tags.append(token.tag_)

df = pd.DataFrame(list(zip(tokens, lemmas, pos_tags, detailed_pos_tags)), 
                  columns =['Token', 'Lemma', 'POS', 'Detailed_POS'])

# TODO: Create dataframe of adjectives only
# TODO: Print the 5 most common adjectives


```

Here's the completed code to create a DataFrame containing only adjectives and to find and print the top five most frequently used adjectives from the Reuters corpus:

```python
import spacy
from nltk.corpus import reuters
import pandas as pd

nlp = spacy.load("en_core_web_sm")

text_list = reuters.words(reuters.fileids()[0])
text = ' '.join(word for word in text_list)

doc = nlp(text)

tokens = []
lemmas = []
pos_tags = []
detailed_pos_tags = []

for token in doc:
    tokens.append(token.text)
    lemmas.append(token.lemma_)
    pos_tags.append(token.pos_)
    detailed_pos_tags.append(token.tag_)

df = pd.DataFrame(list(zip(tokens, lemmas, pos_tags, detailed_pos_tags)), 
                  columns=['Token', 'Lemma', 'POS', 'Detailed_POS'])

# Create dataframe of adjectives only
adjectives_df = df[df['POS'] == 'ADJ']

# Print the 5 most common adjectives
top_adjectives = adjectives_df['Lemma'].value_counts().head(5)
print(top_adjectives)
```

### Explanation:

1. **DataFrame Creation:**  
   The code filters the original DataFrame `df` to create a new DataFrame `adjectives_df` that contains only the rows where the part-of-speech (POS) tag is 'ADJ' (for adjectives).

2. **Finding Top Adjectives:**  
   The code then counts the occurrences of each lemma in the `adjectives_df` and retrieves the top five most frequently used adjectives using `value_counts().head(5)`.

3. **Output:**  
   Finally, it prints the top five adjectives along with their counts.

This code will help you analyze the frequency of adjectives in the text document from the Reuters corpus.


Well done, Stellar Navigator! Now, let's wrap it up with a fun exercise. Pick a word from the Reuters dataset and identify all the different parts of speech usages of this word. Let's explore the word '"back". Get started; words are waiting to be revealed!

```python
import spacy
from nltk.corpus import reuters
import pandas as pd

word_to_explore = '____'

nlp = spacy.load("en_core_web_sm")

# TODO: Load the first 50 documents from the Reuters dataset

# TODO: Perform POS tagging on the text documents just like you did before

# TODO: Create a dataframe that reflects a POS tagging structure

# TODO: Filter the dataframe to display POS tags for your selected word

# TODO: Display the count of each POS category your word falls into

```

Here's the completed code to explore the word "back" in the Reuters dataset, identifying all the different parts of speech usages of this word:

```python
import spacy
from nltk.corpus import reuters
import pandas as pd

word_to_explore = 'back'

nlp = spacy.load("en_core_web_sm")

# Load the first 50 documents from the Reuters dataset
text_list = [' '.join(reuters.words(file_id)) for file_id in reuters.fileids()[:50]]
text = ' '.join(text_list)

# Perform POS tagging on the text documents
doc = nlp(text)

tokens = []
lemmas = []
pos_tags = []
detailed_pos_tags = []

for token in doc:
    tokens.append(token.text)
    lemmas.append(token.lemma_)
    pos_tags.append(token.pos_)
    detailed_pos_tags.append(token.tag_)

# Create a dataframe that reflects a POS tagging structure
df = pd.DataFrame(list(zip(tokens, lemmas, pos_tags, detailed_pos_tags)), 
                  columns=['Token', 'Lemma', 'POS', 'Detailed_POS'])

# Filter the dataframe to display POS tags for your selected word
filtered_df = df[df['Token'].lower() == word_to_explore]

# Display the count of each POS category your word falls into
pos_count = filtered_df['POS'].value_counts()
print(pos_count)
```

### Explanation:

1. **Word Selection:**  
   The variable `word_to_explore` is set to "back".

2. **Loading Documents:**  
   The code loads the first 50 documents from the Reuters dataset and combines their words into a single text string.

3. **POS Tagging:**  
   The text is processed with spaCy to perform part-of-speech tagging.

4. **DataFrame Creation:**  
   A DataFrame `df` is created to store the tokens, lemmas, POS tags, and detailed POS tags.

5. **Filtering for the Selected Word:**  
   The DataFrame is filtered to create `filtered_df`, which contains only the rows where the token matches "back" (case insensitive).

6. **Counting POS Categories:**  
   Finally, the code counts the occurrences of each POS category for the word "back" and prints the results.

This will allow you to explore the different usages of the word "back" in the context of the Reuters dataset.
