# Lesson 2: Installing and Getting Started with spaCy for NLP

### Lesson Overview

In this lesson, we'll sharpen our understanding of tokenization, exploring more advanced aspects such as handling special token types and working with different cases. Using the Reuters dataset and spaCy, our versatile NLP library, we'll go beyond basic tokenization, implementing strategies to handle punctuation, numbers, non-alphabetical characters, and stopwords. This lesson aims to deepen our NLP expertise and make text preprocessing even more effective.

### Tokenization and Its Role in NLP

Firstly, let's revisit tokenization. In our previous lesson, we introduced tokenization as the process of splitting up text into smaller pieces, called tokens. These tokens work as the basic building blocks in NLP, enabling us to process and analyze text more efficiently. It's like slicing a cake into pieces to serve, where each slice or token represents a piece of the overall content (the cake).

Different types of tokens exist, each serving a unique purpose in NLP. Today, we will explore four types: punctuation tokens, numerical tokens, non-alphabetic tokens, and stopword tokens.

### Exploring Different Types of Tokens

Knowing the types of tokens we are working with is fundamental for successful NLP tasks. Let's take a closer look at each one:

- **Punctuation Tokens**: These are tokens composed of punctuation marks such as full stops, commas, exclamation marks, etc. Although often disregarded, punctuation can sometimes hold significant meaning, affecting the interpretation of the text.

- **Numerical Tokens**: These represent numbers found in the text. Depending on the context, numerical tokens can provide valuable information or, alternatively, can serve as noise that you might want to filter out.

- **Non-Alphabetic Tokens**: Such tokens consist of characters that are neither letters nor numbers. They include spaces, punctuation, symbols, etc.

- **Stopword Tokens**: Generally, these are common words like 'is', 'at', 'which', 'on'. In many NLP tasks, stopwords are filtered out as they often provide little to no meaningful information.

### Dealing with Special Cases in Tokenization

Special tokens like those mentioned often need to be treated differently depending on the task at hand. For instance, while punctuation might be critical for sentiment analysis (imagine an exclamation mark to express excitement), you may wish to ignore it while performing tasks like topic identification.

spaCy provides us with simple and efficient methods to handle these token types. For instance, with `token.is_punct`, we can filter out all punctuation tokens from our token list. Similarly, we can use `token.like_num` to filter numerical tokens, `token.is_alpha` to filter out non-alphabetic tokens, and `token.is_stop` to identify stopword tokens.

### Interactive Code Run: Classifying Tokens with spaCy

Let's now run our example code and see these methods in action.

```python
# Import necessary modules
import spacy
from nltk.corpus import reuters

# Load English tokenizer
nlp = spacy.load("en_core_web_sm")

# Get the raw document from reuters dataset
doc_id = reuters.fileids()[0]
doc_text = reuters.raw(doc_id)

# Pass the raw document to the nlp object
doc = nlp(doc_text)

# Get all punctuation tokens in the document
punctuation_tokens = [token.text for token in doc if token.is_punct]
print('Punctuation Tokens: ', punctuation_tokens)

# Get all numerical tokens in the document
numerical_tokens = [token.text for token in doc if token.like_num]
print('Numerical Tokens: ', numerical_tokens)

# Get all non-alphabetical tokens in the document
non_alpha_tokens = [token.text for token in doc if not token.is_alpha]
print('Non-Alphabetic Tokens: ', non_alpha_tokens)

# Get all stopword tokens in the document
stopword_tokens = [token.text for token in doc if token.is_stop]
print('Stopword Tokens: ', stopword_tokens)

# Let's extract the unique non-stopword, non-punctuation alpha tokens from the document
non_stop_alpha_tokens = list(set([token.text for token in doc if not token.is_stop and
                                  token.is_alpha and not token.is_punct]))

print('\nUnique non-stopword, non-punctuation alphanumeric tokens: ', non_stop_alpha_tokens)
```
The output of the above code will be:

```sh
Punctuation Tokens:  ['.', '.', '.', ',', '.', ',', ',', ',', ',', '.', ...] (truncated for brevity)
Numerical Tokens:  ['1986', '15', 'April', '5', '5', '15']
Non-Alphabetic Tokens:  ['1986', '15', '.', '.', 'April', '5', ',', '5', '15', ',', ...] (truncated for brevity)
Stopword Tokens:  ['for', 'the', 'to', 'of', 'and', ...] (truncated for brevity)

Unique non-stopword, non-punctuation alphanumeric tokens:  ['Lorem', 'Ipsum', 'Dolor', 'Sit', 'Amet', ...] (truncated for brevity)
```

This output demonstrates the classification of tokens into different categories using spaCy: punctuation, numerical, non-alphabetical, and stopword tokens. It also showcases the extraction of unique non-stopword, non-punctuation alphanumeric tokens, illustrating a common preprocessing step in NLP.

### Summary and Upcoming Practice Tasks

Great work getting through this lesson! Today, we boosted our understanding of tokenization in NLP, exploring different types of tokens and strategies to handle them. We also dove deep into special token types such as punctuation, numerical tokens, non-alphabetic tokens, and stopwords, understanding why and when they matter in NLP applications.

Now, it's time to cement this knowledge with some hands-on practice. Up next are exercises that will require you to implement the techniques we covered today. Don't worry - solving these tasks will be crucial for mastering token classification and building a strong foundation for more complex NLP tasks. Let's dive in!

## Changing the String for Tokenization

Great job, Stellar Navigator! Let's delve deeper into token analysis. For this task, update the code to filter out tokens that are both stop words and non-alphabetic. Create a new list, non_alpha_stopword_tokens, store the required tokens in it and print out the result.

```python
# Firstly, let's import spaCy and load the English language model
import spacy
nlp = spacy.load('en_core_web_sm')

# import the Reuters Corpus
from nltk.corpus import reuters

# Then select a document from the Reuters Corpus
doc_id = reuters.fileids()[9]
doc_text = reuters.raw(doc_id)

# And Process the text with spaCy
doc = nlp(doc_text)

# Now, let's analyze these tokens more
punctuation_tokens = [token.text for token in doc if token.is_punct]
print('Punctuation tokens: ', punctuation_tokens)

numerical_tokens = [token.text for token in doc if token.like_num]
print('Numerical tokens: ', numerical_tokens)

non_alpha_tokens = [token.text for token in doc if not token.is_alpha]
print('Non-alphabetic tokens: ', non_alpha_tokens)

stopword_tokens = [token.text for token in doc if token.is_stop]
print('Stopword tokens: ', stopword_tokens)

```

Here's the updated code that filters out tokens that are both stop words and non-alphabetic. A new list, `non_alpha_stopword_tokens`, is created to store these tokens, and the result is printed out.

```python
# Firstly, let's import spaCy and load the English language model
import spacy
nlp = spacy.load('en_core_web_sm')

# Import the Reuters Corpus
from nltk.corpus import reuters

# Then select a document from the Reuters Corpus
doc_id = reuters.fileids()[9]
doc_text = reuters.raw(doc_id)

# And process the text with spaCy
doc = nlp(doc_text)

# Now, let's analyze these tokens more
punctuation_tokens = [token.text for token in doc if token.is_punct]
print('Punctuation tokens: ', punctuation_tokens)

numerical_tokens = [token.text for token in doc if token.like_num]
print('Numerical tokens: ', numerical_tokens)

non_alpha_tokens = [token.text for token in doc if not token.is_alpha]
print('Non-alphabetic tokens: ', non_alpha_tokens)

stopword_tokens = [token.text for token in doc if token.is_stop]
print('Stopword tokens: ', stopword_tokens)

# Create a new list for tokens that are both stop words and non-alphabetic
non_alpha_stopword_tokens = [token.text for token in doc if not token.is_alpha and token.is_stop]
print('Non-alphabetic Stopword tokens: ', non_alpha_stopword_tokens)
```

This code will now output the tokens that are both stop words and non-alphabetic, allowing for a more refined analysis of the text.

## Tokenize Sentences with Missing Code

Nice work, Stellar Navigator! It's time to use more built-in token functions. Use the is_oov function to find out-of-vocabulary tokens (words that are not in the predefined vocabulary or word list of a language model) and is_digit to identify digital tokens. Fill in the blanks (____) in the code to accomplish this.

```python
# Firstly, import spaCy and load the English language model
# We are using medium model for `is_oov` function
import spacy
nlp = spacy.load('en_core_web_md')

# Secondly, import the Reuters Corpus
from nltk.corpus import reuters

# Select a document from the Reuters Corpus
doc_id = reuters.fileids()[9]
doc_text = reuters.raw(doc_id)

# Process the text with spaCy
doc = nlp(doc_text)

# Let's analyze these tokens more!
oov_tokens = [token.text for token in doc if ____]
print('Out-of-vocabulary tokens: ', oov_tokens)

digit_tokens = [token.text for token in doc if ____]
print('Digit tokens: ', digit_tokens)

```

Here's the completed code with the appropriate functions filled in to identify out-of-vocabulary tokens and digit tokens:

```python
# Firstly, import spaCy and load the English language model
# We are using medium model for `is_oov` function
import spacy
nlp = spacy.load('en_core_web_md')

# Secondly, import the Reuters Corpus
from nltk.corpus import reuters

# Select a document from the Reuters Corpus
doc_id = reuters.fileids()[9]
doc_text = reuters.raw(doc_id)

# Process the text with spaCy
doc = nlp(doc_text)

# Let's analyze these tokens more!
oov_tokens = [token.text for token in doc if token.is_oov]
print('Out-of-vocabulary tokens: ', oov_tokens)

digit_tokens = [token.text for token in doc if token.is_digit]
print('Digit tokens: ', digit_tokens)
```

In this code, `token.is_oov` is used to find out-of-vocabulary tokens, and `token.is_digit` is used to identify digital tokens. This will help you analyze the text more effectively!

## Tokenizing First Reuters Document with spaCy

You're doing great, Stellar Navigator!

Let's make it a bit more challenging, shall we? Fill in the blank spots (____) to make the code work.

It should count the number of stop words in the text.
Print the count of stop word tokens.

```python
# Firstly, import spaCy and load the English language model
import spacy
nlp = spacy.load('en_core_web_sm')

# Secondly, import the Reuters Corpus
from nltk.corpus import reuters

# Select a document from the Reuters Corpus
doc_id = reuters.fileids()[11]
doc_text = reuters.raw(doc_id)

# Process the text with spaCy
doc = nlp(doc_text)

# Let's analyze these tokens more!
stopword_tokens = [token.text for token in doc if ____]
print('Number of stopword tokens: ', ____)

```

Here's the completed code with the appropriate functions filled in to count the number of stop words in the text:

```python
# Firstly, import spaCy and load the English language model
import spacy
nlp = spacy.load('en_core_web_sm')

# Secondly, import the Reuters Corpus
from nltk.corpus import reuters

# Select a document from the Reuters Corpus
doc_id = reuters.fileids()[11]
doc_text = reuters.raw(doc_id)

# Process the text with spaCy
doc = nlp(doc_text)

# Let's analyze these tokens more!
stopword_tokens = [token.text for token in doc if token.is_stop]
print('Number of stopword tokens: ', len(stopword_tokens))
```

In this code, `token.is_stop` is used to filter the stop words, and `len(stopword_tokens)` counts the number of stop word tokens. This will print the total count of stop words in the selected document.

## Calculating Unique Tokens in Document

Great job, Celestial Traveler!

For this task, you are required to complete the print statements to display the uppercase and lowercase tokens in the document. Utilize what you have learned in the lesson to analyze the tokens!

```python
# Firstly, import spaCy and load the English language model
import spacy
nlp = spacy.load('en_core_web_sm')

# Secondly, import the Reuters Corpus
from nltk.corpus import reuters

# Select a document from the Reuters Corpus
doc_id = reuters.fileids()[2]
doc_text = reuters.raw(doc_id)

# Process the text with spaCy
doc = nlp(doc_text)

# Let's analyze these tokens more!
capitalized_tokens = [token.text for token in doc if token.is_title]
print('Capitalized tokens: ', capitalized_tokens)

# TODO: Print uppercase tokens

# TODO: Print lowercase tokens

```

Here's the completed code with the print statements filled in to display the uppercase and lowercase tokens in the document:

```python
# Firstly, import spaCy and load the English language model
import spacy
nlp = spacy.load('en_core_web_sm')

# Secondly, import the Reuters Corpus
from nltk.corpus import reuters

# Select a document from the Reuters Corpus
doc_id = reuters.fileids()[2]
doc_text = reuters.raw(doc_id)

# Process the text with spaCy
doc = nlp(doc_text)

# Let's analyze these tokens more!
capitalized_tokens = [token.text for token in doc if token.is_title]
print('Capitalized tokens: ', capitalized_tokens)

# Print uppercase tokens
uppercase_tokens = [token.text for token in doc if token.is_upper]
print('Uppercase tokens: ', uppercase_tokens)

# Print lowercase tokens
lowercase_tokens = [token.text for token in doc if token.is_lower]
print('Lowercase tokens: ', lowercase_tokens)
```

In this code, `token.is_upper` is used to filter uppercase tokens, and `token.is_lower` is used to filter lowercase tokens. The results will display the respective tokens from the selected document.

## Tokenizing Multiple Reuters Documents with spaCy

Fantastic effort, Stellar Navigator! Now, for the final challenge: Write a simple pipeline to tokenize a sentence, filter out punctuation, stop words, and non-alphabetic tokens, and then print out the remaining tokens. Remember, practice is the key to mastering any skill!

```python
# Firstly, import spaCy and load the English language model
import spacy
nlp = spacy.load('en_core_web_sm')

# TODO: Define a sentence to process


# TODO: Process the sentence with spaCy


# TODO: Step1: Filter out punctuation


# TODO: Step2: Filter out stop words


# TODO: Step3: Filter out non-alphabetic tokens


# TODO: Print the remaining tokens

```

Here's the completed code for a simple pipeline that tokenizes a sentence, filters out punctuation, stop words, and non-alphabetic tokens, and then prints out the remaining tokens:

```python
# Firstly, import spaCy and load the English language model
import spacy
nlp = spacy.load('en_core_web_sm')

# Define a sentence to process
sentence = "The quick brown fox jumps over the lazy dog!"

# Process the sentence with spaCy
doc = nlp(sentence)

# Step 1: Filter out punctuation
tokens_no_punct = [token for token in doc if not token.is_punct]

# Step 2: Filter out stop words
tokens_no_stop = [token for token in tokens_no_punct if not token.is_stop]

# Step 3: Filter out non-alphabetic tokens
remaining_tokens = [token.text for token in tokens_no_stop if token.is_alpha]

# Print the remaining tokens
print('Remaining tokens: ', remaining_tokens)
```

In this code, we define a sentence, process it with spaCy, and then filter out punctuation, stop words, and non-alphabetic tokens step by step. Finally, the remaining tokens are printed out. This pipeline effectively demonstrates the tokenization and filtering process!