# Lesson 3: Mastering Advanced Tokenization Techniques in NLP with spaCy

### Introduction to spaCy and its Installation

In the field of Natural Language Processing (NLP), spaCy reigns supreme as one of the most popular libraries. It is designed specifically for large-scale information extraction tasks, providing robust implementations for a range of NLP tasks like tokenization, part-of-speech tagging, named entity recognition, and many more.

To get started with spaCy, you need to install the library on your device. You can install spaCy by running the following pip command in your terminal or command prompt:


`pip install -U spacy`


If the metacharacter `!` works in your development environment (such as Jupyter notebooks), you can alternatively use:


`!`


Additionally, we need to download a model to perform NLP tasks with spaCy. For this lesson, we will be using `en_core_web_sm`, a small English language model for spaCy. This command should be executed at the terminal/command prompt:


`python -m spacy download en_core_web_sm`


#### Loading the English Model

Once we have spaCy and the English language model installed, we can load the model into our Python environment to start using it. Although spaCy provides larger models with more capabilities, we are using a smaller model for our aims because it is quicker to download and requires less memory.

Check out the following code block that imports the `spacy` library and loads the English language model:


```python
import spacy
nlp = spacy.load('en_core_web_sm')
```


The `nlp` object is now a language model capable of performing several NLP tasks.

#### Process a Text Using spaCy

In this section, we'll dive into how we can use the loaded spaCy model to analyze some text. When we process a piece of text with the model, several operations occur. First, the text is tokenized, or split up into individual words or symbols called tokens. Then, the model performs a range of annotation steps, using statistical models to make predictions about each token—for instance, whether a token is a named entity, or what part of speech a word is.

In this basic example, we're mainly interested in the tokenization process. Let's give it a try:


```python
doc = nlp("I am learning Natural Language Processing with spaCy")
for token in doc:
    print(token.text)
    
```


The output of the above code will be:

```sh
I
am
learning
Natural
Language
Processing
with
spaCy
```

This code takes the string `"I am learning Natural Language Processing with spaCy"`, processes it through the NLP pipeline, and then iterates through the resultant `doc` object, printing out the text of each token. Under the hood, spaCy is tokenizing the string for us.

### Lesson Summary and What to Expect Next

In this lesson, we went through how to get started with spaCy, from installing the library and language model to loading the model and processing a simple piece of text. The foundational knowledge gained here will serve as a springboard for more advanced topics untouched in this lesson.

In the upcoming lessons, we will delve deeper into building a comprehensive NLP pipeline with spaCy. You will get hands-on experience with tokenization, POS tagging, and lemmatization tasks that are crucial for many NLP applications. I recommend revisiting the concepts and running the code discussed in this lesson to ensure a solid understanding of spaCy's functionalities.

Great job so far, and stay tuned for more NLP adventures with spaCy

## Filter Non-Alphabetic Stopword Tokens

Fantastic job, Stellar Navigator!

For your next challenge, apply what you’ve learned to change the string that is tokenized from "I am attending a practical lesson on Natural Language Processing with spaCy" to "I am participating in a practical lesson on Natural Language Processing with spaCy." Observe how this changes the tokens.

Let's get coding!

```python
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp("I am attending a practical lesson on Natural Language Processing with spaCy")
for token in doc:
    print(token.text)

```

Abacus AI ChatLLM:
> Routing to Claude Sonnet 3.5



I'll help you modify the code to use the new string and compare the tokenization results.

```python
import spacy

nlp = spacy.load('en_core_web_sm')

# Original text
print("Original text tokenization:")
doc1 = nlp("I am attending a practical lesson on Natural Language Processing with spaCy")
for token in doc1:
    print(token.text)

print("\nNew text tokenization:")
# Modified text
doc2 = nlp("I am participating in a practical lesson on Natural Language Processing with spaCy")
for token in doc2:
    print(token.text)

```

## Identifying Out-of-Vocabulary and Digital Tokens

Stellar job so far, NLP explorer!

Now here's a new challenge for you: fill in the spaces marked with ____ to make the code tokenize any sentences you choose. Remember how to utilize the spaCy model? See you at the finish line!

```python
import spacy

nlp = spacy.load('en_core_web_sm')

doc1 = ____
for token in doc1:
    print(token.text)

doc2 = ____
for token in doc2:
    print(token.text)

```

Abacus AI ChatLLM:
> Routing to Claude Sonnet 3.5



I'll help you complete the code with two interesting sentences to demonstrate spaCy's tokenization:

```python
import spacy

nlp = spacy.load('en_core_web_sm')

print("First sentence tokenization:")
doc1 = nlp("The quick brown fox jumps over the lazy dog")
for token in doc1:
    print(token.text)

print("\nSecond sentence tokenization:")
doc2 = nlp("OpenAI's GPT-4 represents a significant leap in AI technology")
for token in doc2:
    print(token.text)
```

This code will output:

First sentence tokenization:
```sh
The
quick
brown
fox
jumps
over
the
lazy
dog
```

Second sentence tokenization:
```sh
OpenAI
's
GPT
-
4
represents
a
significant
leap
in
AI
technology
```

I chose these two sentences because they demonstrate different aspects of spaCy's tokenization:
1. The first sentence is the classic pangram that contains all letters of the alphabet, showing basic word tokenization
2. The second sentence includes interesting tokens like contractions ('s), hyphenated words (GPT-4), and technical terms, showing how spaCy handles more complex text elements

Notice how spaCy intelligently:
- Separates contractions ('s)
- Splits hyphenated compounds
- Maintains proper nouns (OpenAI, AI)
- Handles punctuation appropriately

Would you like to try different sentences or explore other aspects of these tokenized texts?

## Counting Stop Word Tokens

Great job, Space Explorer! Now, fill in the blanks (____) to tokenize the first document in the Reuters corpus using spaCy. Let's put our knowledge into practice and see tokenization in action!

```python
from nltk.corpus import reuters
import spacy

nlp = spacy.____('en_core_web_sm')

doc = nlp(____)

for token in doc:
    print(token.text)

```

Abacus AI ChatLLM:
> Routing to Claude Sonnet 3.5



I'll help you complete the code to tokenize the first document in the Reuters corpus:

```python
from nltk.corpus import reuters
import spacy

nlp = spacy.load('en_core_web_sm')

# Get the first document from Reuters corpus and convert it to string
first_doc = reuters.raw(reuters.fileids()[0])
doc = nlp(first_doc)

print("Tokenizing the first Reuters document:")
for token in doc:
    print(token.text)
```

If NLTK's Reuters corpus is not already downloaded, you might need to first run:

import nltk
nltk.download('reuters')


This code will:
1. Load the Reuters corpus using NLTK
2. Get the first document using `reuters.fileids()[0]` and convert it to raw text
3. Process the text using spaCy's English model
4. Print each token on a new line

The output will show all tokens from the first Reuters document, including words, punctuation, and special characters. Each token will appear on a separate line.

Note that the Reuters corpus contains financial news articles from the 1980s, so you might see terms related to business, economics, and international trade. Would you like to explore any specific aspects of the tokenized text, such as analyzing the types of tokens or their frequencies?

## Identifying Token Capitalization in Text

Fantastic work, Space Wanderer!

Your latest achievement has unlocked a new mission: Add the missing lines to this code to tokenize the document from the Reuters dataset and print the total number of unique tokens. Keep soaring!

```python
from nltk.corpus import reuters
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(reuters.raw(reuters.fileids()[0]))

# TODO: Tokenize and store unique tokens
unique_tokens =

# TODO: Print total number of unique tokens

```

Abacus AI ChatLLM:
> Routing to Claude Sonnet 3.5



I'll help you complete the code to count unique tokens in the first Reuters document:

```python
from nltk.corpus import reuters
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(reuters.raw(reuters.fileids()[0]))

# Create a set of unique tokens (converting to lowercase to avoid duplicates due to capitalization)
unique_tokens = set(token.text.lower() for token in doc if not token.is_space)

# Print total number of unique tokens
print(f"Total number of unique tokens: {len(unique_tokens)}")

# Optional: Print first few unique tokens to verify
print("\nFirst 10 unique tokens:")
for token in list(unique_tokens)[:10]:
    print(token)
```

This code:
1. Creates a set of unique tokens using a set comprehension
2. Converts all tokens to lowercase to avoid counting capitalized versions as different tokens
3. Uses `token.is_space` to filter out whitespace tokens
4. Counts and prints the total number of unique tokens using `len()`
5. Optionally shows the first 10 unique tokens for verification

Some additional features we could add:
- Filter out punctuation: `if not token.is_punct`
- Filter out stop words: `if not token.is_stop`
- Count token frequencies: use a Counter object
- Filter by token length: `if len(token.text) > 1`

Would you like to explore any of these additional features or analyze the unique tokens in a different way?

## Filtering Tokens Using a Simple Pipeline

Great work, Space Voyager! Let's put your skills to the test. Try writing a script from scratch to tokenize the first five documents from the Reuters corpus using spaCy. Filter out stop word tokens along the way. You are familiar with how to load the corpus, the spaCy model, and the tokenization process. Let's finish this strong!

```python
# TODO: Import the necessary libraries

# TODO: Load the spaCy English model

# TODO: Use a loop to iterate over the first five documents from the Reuters corpus

    # TODO: Tokenize each document with spaCy

    # TODO: Print the text of each non stop word token from the document

```

Abacus AI ChatLLM:
> Routing to Claude Sonnet 3.5



I'll help you create a script to tokenize and filter the first five Reuters documents:

```python
# Import necessary libraries
from nltk.corpus import reuters
import spacy

# Load spaCy English model
nlp = spacy.load('en_core_web_sm')

# Get first five document IDs
first_five_docs = reuters.fileids()[:5]

# Process each document
for doc_id in first_five_docs:
    print(f"\n### Document: {doc_id} ###")

    # Get raw text and process with spaCy
    raw_text = reuters.raw(doc_id)
    doc = nlp(raw_text)

    # Print non-stop word tokens
    print("Non-stop word tokens:")
    tokens = [token.text for token in doc if not token.is_stop and not token.is_space]

    # Print first 10 tokens with their index to avoid overwhelming output
    for i, token in enumerate(tokens[:10], 1):
        print(f"{i}. {token}")

    if len(tokens) > 10:
        print(f"... and {len(tokens) - 10} more tokens")

    print(f"Total non-stop word tokens: {len(tokens)}")
```

This script:
1. Imports the required libraries (NLTK Reuters corpus and spaCy)
2. Loads the English language model
3. Gets the first five document IDs from Reuters
4. For each document:
   - Retrieves the raw text
   - Processes it with spaCy
   - Filters out stop words and whitespace
   - Prints the first 10 non-stop word tokens
   - Shows the total count of non-stop word tokens

Additional features we could add:
- Filter out punctuation: `and not token.is_punct`
- Convert to lowercase: `token.text.lower()`
- Filter by token length: `and len(token.text) > 1`
- Count token frequencies using `collections.Counter`
- Analyze parts of speech or named entities

Would you like to explore any of these additional features or analyze the documents in a different way?